Json: Soften the landing when dumping non-UTF8 strings (type_error.316 exception)

Created on 17 Aug 2018  ·  67Comments  ·  Source: nlohmann/json

When dumping a json object that contains an std:string we could be thrown the type_error.316 exception if the string contains invalid UTF8 sequences. While this sounds like a reasonable thing to do, it actually causes more problems than it fixes. In all the cases where user-entered (unsafe) strings end up stored in nlohmann::json objects the developer now has to "do something" before assigning a string value to some nlohmann::json variable. This effectively renders the whole string assignment functionality unsafe and defeats its purpose.

Below is the wrapper code I had to write in order to investigate the random crashes my application went through due to the 316 exception.

// Declaration:
std::string dump_json(const nlohmann::json &json, const int indent =-1, const char* file = __builtin_FILE(), int line = __builtin_LINE()) const;

// Definition
std::string MANAGER::dump_json(const nlohmann::json &json, const int indent, const char* file, int line) const {
    std::string result;
    try {
        result = json.dump(indent);
    }
    catch (nlohmann::json::type_error& e) {
        bug("%s: %s (%s:%d)", __FUNCTION__, e.what(), file, line);
    }
    return result;
}

This led me to the discovery of the code in my application that was sending json formatted log events to the log server. The log event has to store user entered data and I would expect the dump function to deal with invalid UTF8 sequences.

If I have to use my json dump wrapper everywhere in my application code then of what use is the original dump function to begin with? What is more, I'd actually have to enhance the wrapper function to iterate over all strings stored in the json object and do _something_ about the possible invalid UTF8 sequences. Not very convenient.

Therefore, I would propose the default behavior of the dump function to simply ignore (skip or replace) invalid UTF8 sequences and NOT throw. Or perhaps add a nothrow parameter to the string assignment = operator.

If left like that, I probably won't be the last developer who assumed that the dump method is safe to use. After all, if the lib allows me to assign a value to the json object then how can it be that it lets me assign values that later invalidate the whole json? This is illogical.

// nothrow assignment would look like this
nlohmann::json j_nothrow = (std::nothrow) "\xF0\xA4\xAD\xC0"; // Does not throw.
// I am not 100% sure if this can be done, but since the `new` operator can have such
// a parameter I would suppose the assignment operator = could also be customized
// like that.

std::string dumped = j_nothrow.dump(); // Never throws, just works.

nlohmann::json j_throw;
try {
    j_throw = "\xF0\xA4\xAD\xC0"; // Throws immediately.
}
catch (nlohmann::json::type_error& e) {
    // handle the invalid UTF8 string
}

One more thing. Why not include the file and line from where the exception originated in the exception description? You can see in my above wrapper function how I am using __builtin_LINE() and __builtin_FILE() to store the file and line information from where the wrapper call was made. It is super useful when debugging and exception description is all about debugging.

release item new feature proposed fix

Most helpful comment

I think it is out of scope of the library to fix invalid UTF-8. I further do not think that silently fixing invalid inputs makes the world a better place, but rather rejecting such inputs with as much noise and as early as possible.

All 67 comments

I have slapped together a general purpose utf8 trimming function that you may find useful.

It takes the unsafe std::string as an argument and removes all invalid UTF8 sequences from it.

void trim_utf8(std::string& hairy) {
    std::vector<bool> results;
    std::string smooth;
    size_t len = hairy.size();
    results.reserve(len);
    smooth.reserve(len);
    const unsigned char *bytes = (const unsigned char *) hairy.c_str();

    auto read_utf8 = [](const unsigned char *bytes, size_t len, size_t *pos) -> unsigned {
        int code_unit1 = 0;
        int code_unit2, code_unit3, code_unit4;

        if (*pos >= len) goto ERROR1;
        code_unit1 = bytes[(*pos)++];

             if (code_unit1 < 0x80) return code_unit1;
        else if (code_unit1 < 0xC2) goto ERROR1; // continuation or overlong 2-byte sequence
        else if (code_unit1 < 0xE0) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; //2-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            return (code_unit1 << 6) + code_unit2 - 0x3080;
        }
        else if (code_unit1 < 0xF0) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; // 3-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; // overlong
            if (*pos >= len) goto ERROR2;
            code_unit3 = bytes[(*pos)++];
            if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
            return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
        }
        else if (code_unit1 < 0xF5) {
            if (*pos >= len) goto ERROR1;
            code_unit2 = bytes[(*pos)++]; // 4-byte sequence
            if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
            if (code_unit1 == 0xF0 && code_unit2 <  0x90) goto ERROR2; // overlong
            if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; // > U+10FFFF
            if (*pos >= len) goto ERROR2;
            code_unit3 = bytes[(*pos)++];
            if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
            if (*pos >= len) goto ERROR3;
            code_unit4 = bytes[(*pos)++];
            if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
            return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
        }
        else goto ERROR1; // > U+10FFFF

        ERROR4:
        (*pos)--;
        ERROR3:
        (*pos)--;
        ERROR2:
        (*pos)--;
        ERROR1:
        return code_unit1 + 0xDC00;
    };

    unsigned c;
    size_t pos = 0;
    size_t pos_before;
    size_t inc = 0;
    bool valid;

    for (;;) {
        pos_before = pos;
        c = read_utf8(bytes, len, &pos);
        inc = pos - pos_before;
        if (!inc) break; // End of string reached.

        valid = false;

        if ( (                 c <= 0x00007F)
        ||   (c >= 0x000080 && c <= 0x0007FF)
        ||   (c >= 0x000800 && c <= 0x000FFF)
        ||   (c >= 0x001000 && c <= 0x00CFFF)
        ||   (c >= 0x00D000 && c <= 0x00D7FF)
        ||   (c >= 0x00E000 && c <= 0x00FFFF)
        ||   (c >= 0x010000 && c <= 0x03FFFF)
        ||   (c >= 0x040000 && c <= 0x0FFFFF)
        ||   (c >= 0x100000 && c <= 0x10FFFF) ) valid = true;

        if (c >= 0xDC00 && c <= 0xDCFF) {
            valid = false;
        }

        do results.push_back(valid); while (--inc);
    }

    size_t sz = results.size();
    for (size_t i = 0; i < sz; ++i) {
        if (results[i]) smooth.append(1, hairy.at(i));
    }

    hairy.swap(smooth);
}

Thanks! I'll check. I already have some code to translate from UTF-16 and UTF-32 to UTF-8 to parse from wide strings. But this would only help if your string is properly encoded with these encodings.

Yeah, I can't agree that the correct solution is to remove the invalid UTF-8. And I'm also strongly against the practice of "rolling your own unicode conversion routines."

If the unicode consortium had to withdraw the much-used source-code that forms the basis of 80% of (the better) conversions due to it being wildly inaccurate with respect to today's unicode, I dread the corner cases that would arise from something you wrote yourself.

The only library I trust is boost::nowide (available as a seperate, non-boost package), and that does not handle UTF-32.

see: http://unicode.org/forum/viewtopic.php?f=9&t=90 for more information on the ConvertUTF8.h library and it's withdrawal

... excerpt from http://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#G7404 (page 122)

The UTF-8 code unit sequence <41 C3 B1 42> is well-formed, because it can be
partitioned into subsequences, all of which match the specification for UTF-8
in Table 3-7. It consists of the following minimal well-formed code unit subsequences:
<41>, <C3 B1>, and <42>.

• The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed, because it contains
one ill-formed subsequence. There is no subsequence for the C2 byte
which matches the specification for UTF-8 in Table 3-7.

See page 126 (4 pages down) for the tables in question.

boost::nowide appears to be capable of lossless conversion of invalid code sequences between UTF-8 and UTF-16. Though as mentioned, it has no UTF-32 support. This may not be an issue, since simple iteration through the std::wstring UTF-16 version should allow one to distinguish between valid and invalid UTF, and encode appropriately.

At this point, I'd be much more concerned with the requirement to have an additional library (even without boost, it's still a few files - though purely .hpp).

I think this falls under "author's discretion."

#include <string>
#include <nowide/convert.hpp>

using namespace nowide;

int main(int argc, const char **argv) {
    std::string invalid("\u0006\u00e4\u00b4\u00d8\u008d\"");

    auto _wide = widen(invalid);
    auto _narrow = narrow(_wide);

    printf("%s\n", invalid.c_str());                  fflush(stdout);
    printf("%s\n", _narrow.c_str());                  fflush(stdout);
    printf("%s\n", invalid == _narrow ? "==" : "!="); fflush(stdout);
    printf("test complete.\n");                       fflush(stdout);

    printf("narrow was %lli characters, wide was %lli characters.\n",
            _narrow.length(), _wide.length());
    exit(0);
}

/*
ä´Ø"
ä´Ø"
==
test complete.
narrow was 10 characters, wide was 6 characters.
*/

https://github.com/artyom-beilis/nowide

Right now, the library only supports UTF-8. (It recently added support to parse from wide strings, but internally, everything is stored as UTF-8). We had the issue that invalid UTF-8 was stored in the library and dumped without notice, and the issue was only apparent when this value was parsed again. Therefore, we added an exception when dumping invalid UTF-8. One may argue that it would be even better to throw when invalid UTF-8 is stored in the first place.

I am not sure what nowide would help here.

@nlohmann exactly! thanks for at least getting my point. I would definitely argue that if invalid utf8 will throw then it should throw during assignment not during dumping.

@sfinktah IF you didn't know, then JSON has no formal meaning for non-utf8 byte sequences. it's as if JSON was color-blind to a certain group of byte sequences. Do color-blind people get thrown an exception every time they look at something that is either red or green? NO. It would be silly. Do you suffer a seizure when infrared light gets into your eyes? Think of nlohmann's json library the same way. Its main purpose is to make the developer's life easier, not harder. Stripping out invalid UTF8 sequences is the way to go. I already proposed a std::nothrow version for the assignment, so if you love your exceptions so much you'd have to do nothing as the proposed solution would be opt-in. I would advise you to rethink your position.

@1Hyena Traffic lights are arranged in a specific color sequence, precisely so that people who suffer red/green color blindness, can determine what "color" a light is. In fact, I believe the positioning/order of colors is actually a world-wide standard.

JSON also has no formal definition for capital letters, French, or spacing between words.

@nlohmann I've completed my proof of concept, showing how nowide can be of use. Please do not confuse this with a recommendation that you use nowide. It was just a trusted tool that I had to hand.

You may check out the function proof-of-concept at https://github.com/sfinktah/utftest/

It is coded a bit absurdly, but it proves a point. Namely, that an input of

std::string invalid("^F     └─ä┘´Ø<8d>");

can be encoded without any loss, to

\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"

Using (purely as an example) the boost::nowide library as a filter for valid UTF-8 sequences.

And, FTR, I'm a subscriber of the UTF-8 Everywhere manifesto. I despise UTF-16.

    // g++ -I. -std=c++1z testutf2.cpp -o testutf2 && ./testutf2
    #include <algorithm>
    #include <array>
    #include <nowide/convert.hpp>
    #include <string>
    #include <vector>

    using namespace std::string_literals;
    using namespace nowide;

    std::string escape_json_char(int c) {
        using T                 = char;
        static auto escape_code = [](auto d) { return "\\"s + (T)d; };

        static auto unicode_escape = [](auto d) {
            static const std::array<char, 16> hexify = {
                {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'}
            };
            auto esc = escape_code('u');

            // this will spaz on numbers greaters than 0xffff
            for (int i = 0; i < 4; ++i)
                esc += (T)hexify[(d & 0xf000) >> 12], d <<= 4;

            return esc;
        };

        switch (c) {
            case '\b': return escape_code('b');  // 0x08
            case '\t': return escape_code('t');  // 0x09
            case '\n': return escape_code('n');  // 0x0a
            case '\r': return escape_code('r');  // 0x0d
            case '\\': return escape_code('\\');
            case '"':  return escape_code('"');
        }

        if (c >= 32 && c <= 126)
            return std::basic_string<T>((T)c, 1);

        return unicode_escape(c);
    }

    template <class T> auto escape_json(T const *p) { return escape_json(std::basic_string<T>(p)); }
    template <class T> auto escape_json(std::basic_string<T> s) {
        std::vector<std::string> escaped;
        std::transform(s.begin(), s.end(), std::back_inserter(escaped), escape_json_char);
        return escaped;
    }

    int main() {
        std::string invalid("\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"");

        auto _wide   = widen(invalid);
        auto _narrow = narrow(_wide);

        printf("%s\n", invalid.c_str()); fflush(stdout);
        printf("%s\n", _narrow.c_str()); fflush(stdout);
        printf("%s\n", invalid == _narrow ? "==" : "!="); fflush(stdout);

        printf("narrow was %lli characters, wide was %lli characters.\n", _narrow.length(), _wide.length());

        auto escaped = escape_json(_wide);

        for (const auto& e : escaped)
            printf("%s", e.c_str());

        printf("\n\ntest complete.\n"); fflush(stdout);
        exit(0);
    }

    /*
           └─ä┘´Ø"
           └─ä┘´Ø"
    ==
    narrow was 21 characters, wide was 11 characters.
    \u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\"

    test complete.
    */
    /* vim: set ts=4 sts=4 sw=4 et: */

@sfinktah the traffic lights comparison does not make any sense. I am asking you a simple question. Your data format is JSON and you are given a string that contains byte sequences that can't be expressed in JSON. What are you going to do? You have to do something, you just can't shut down the application. You also can't ignore the input because you need as much information as possible. You can't reject the input either. How are you going to solve this problem in the REAL WORLD that is often dirty and not perfect. You are appearing as someone living in a glass castle on a puffy cloud. Sure, it's easy to whine and moan about abstract ideals like a dogmatic zealot of some particular programming religion. But in the real world, problems have to be solved in a best possible way. Sacrifices and compromises are a normal thing in the real world. And I see nlohmann::json as a library for the real world not for a mathematical utopia. So think again, is it really a bad idea to discard the byte sequences that couldn't ever be expressed in JSON anyway, especially when the programmer has deliberately indicated that it is okey by calling the nothrow version of string assignment.

As for UTF16 and anything other, I agree with you. UTF8 is the only sensible character encoding and anything else is a cancer. Call me Hitler I don't care. I am proudly a UTF8 supremacist.

For what it's worth: I think filtering invalid UTF-8 sequences is not a good idea.

@nlohmann any better alternative?

if you're building an application that must overcome an exceptional situation and still give the best possible performance then I would say this is the only ideal solution.

I think it is out of scope of the library to fix invalid UTF-8. I further do not think that silently fixing invalid inputs makes the world a better place, but rather rejecting such inputs with as much noise and as early as possible.

@nlohmann
you want the devs to litter their codebase with try catch block around each string assignment just because you think utf8 trimming it is out of the scope? it's enough headache that you rely on exceptions as much as as you already do, not to mention that the namespace nlohmann:: is tediously long.

I understand your frustration that json isn't compatible with non-utf8 strings, this indeed causes a lot of mess, but we have to deal with this mess with the best possible way. the header is already a giant, adding one more function to it won't be much of a problem, but it saves a ton of lines of code for the devs using your lib.

Besides, with my proposed solution you can get rid of this exception once and for all. When assigning a non-utf8 string to the json object, the contained string will be shorter than the string that was originally assigned. it makes it possible for the dev to quickly check whether the original string contained any invalid utf8 sequences. You already have utf related code in the header anyway, so "out of scope" doesn't really apply here.

std::string invalid_utf8 = "\xF0\xA4\xAD\xC0";
nlohmann::json json = invalid_utf8; // This would work and would never throw.

// Oh but I want to know if the input was invalid utf8? What should I do?
if (json.length() != invalid_utf8.length()) {
    // we have a problem
}

In this situation, if you rely on try catch blocks, every string assignment should be surrounded with them, littering the code and making it ugly with no benefit. If I have to write extra code EITHER WAY then I could as well as trim the strings myself. So no one would never actually need your exception to begin with. I hope this finally explained my reasoning because it seems as if we have a situation where some of us have developed a Stockholm syndrome regarding exceptions.

Don't just sweep the problem under the carpet, solve it. Assume that the devs using this lib are not going to surround each string assignment with try/catch blocks no matter how much you would like it. And you already agreed that throwing from dumping is an even worse fallacy than throwing from string assignments, so it pretty much leaves you with no other sane solutions than to trim the utf8. Prove me wrong or admit your bias.

Listen, I do not need to prove anyone wrong or admit anything.

@1Hyena or you could just do this, which also avoids the exceptions:

std::string invalid_utf8 = "\xF0\xA4\xAD\xC0";
nlohmann::json json = fix_my_invalid_string(invalid_utf8); // This would work and would never throw.

I think it's perfectly reasonable for this library to assume the strings it gets are valid (ie, perfect UTF8).

Imagine the chaos if programmers claimed that strdup should fix UTF-8 for you, or that printf and fopen and std::regex_match and all other library functions needed to become encoding-aware and must also sanitize their string-inputs better. No, on the contrary, each library function has a precondition that the programmer must meet, and this library says that "strings are valid UTF-8" -- which is perfectly reasonable.

Sure, this library is about text parsing and needs to be aligned towards text-processing things, but that doesn't mean this library should be burdened with a grab-bag of Unicode transcoding functions. There are libraries whose sole purpose is do exactly that, and they'll do it far better, far faster, and with far fewer bugs than if this library took those on as a nice-to-have side-project.

If the programmer is unsure whether a string may or may not be valid UTF-8 at any given point in the program, the onus is on them to make sure their code translates them to "valid" inputs for whatever functions they call (and gregmarr provides a great example for how to do that in his comment).

@gregmarr yes that would work, although it would be only slightly better/prettier than surrounding each string assignment with try/catch blocks.

@jaredgrubb contrary to @nlohmann 's response I respect yours for the reason that you actually bothered to explain. However, I still cannot completely agree with you. strdup and your other comparisons you brought up do not need to know about utf8 but json specifically says it can only store utf8 strings.

do you know what overexposure is in photography? well, it gets handled the same way as trimming utf8 would be.

if you want to insist that utf8 related code has no place in this lib then go to its source code and search for the word utf. IT IS FULL OF IT. So it is annoyingly hypocritical to see some of the participants here talking that stripping out utf8 bytes does not belong into the context while the context is full of utf8 code already.

If the programmer is unsure whether a string may or may not be valid UTF-8 at any given point in the program, the onus is on them to make sure their code translates them to "valid" inputs for whatever functions they call (and gregmarr provides a great example for how to do that in his comment).

All I'm trying to say is that the current implementation of this json lib is a fruitful grounds for shooting yourself in the foot, and I was hoping to find a way how to avoid it. I like your reasoning though, @nlohmann should try to learn from it. I think this situation is similar to escaping in general, for example to avoid SQL injections. Should the lib escape the potentially harmful string for the dev or not? It depends, but most of the times such escapes are not automatically made, and in that light I tend to agree with you. If I am absolutely 100% sure that my strings are always utf8 then of course such trimming would be pointless overhead.

Yeah, in the end, after giving this a second thought, I think the correct approach would be to throw from string assignments but never from dump(), but the named utf8 trimming function could still be useful for this lib. At least I would find it useful :P

@jaredgrubb contrary to @nlohmann 's response I respect yours for the reason that you actually bothered to explain.
I like your reasoning though, @nlohmann should try to learn from it.

You do realize that this is his project, and he's free to do with it as he wishes, right? For you to come in here and insist that he must "prove" or "admit" or "learn" anything just because you say so is extremely rude.

@gregmarr yes that would work, although it would be only slightly better/prettier than surrounding each string assignment with try/catch blocks.

It's part of sanitizing user input. The user input you're getting may be improper in any number of ways. It's up to you to make sure that it is properly formed before passing it along to any other tool.

There could be any number of ways for the library to respond to invalid strings, and not one is proper for every use case. Your application is the best one to know whether an invalid string should result in an error being returned to the user, or a warning being returned to the user with a best effort result, or no error being returned to the user with the invalid portions just removed in an arbitrary fashion.

if you want to insist that utf8 related code has no place in this lib then go to its source code and search for the word utf. IT IS FULL OF IT. So it is annoyingly hypocritical to see some of the participants here talking that stripping out utf8 bytes does not belong into the context while the context is full of utf8 code already.

That's not what was said. What was said was that code which takes things which are supposed to be valid UTF-8 but are not, and somehow fixing them to be UTF-8, such as by silently dropping those bytes, does not belong here.

@gregmarr you do realize that string assignments in its current form are dangerous? sure, build a big application using this lib under the assumption that assignments are always for valid utf8 and then one day you change some unrelated code and everything breaks because now all of sudden some string may contain invalid bytes. So yes, I know why you are like that but do you know what I am saying? It's easy to regurgitate your point but I can see it's not so easy to listen what the other person wants to say. And speaking of @nlohmann he's the one who was rude to begin with, just because it's his project doesn't mean his shit doesn't stink. These days we could always fork, but is this really the solution you want to propagate. A fork should be the last resort, which will unfortunately happen more often if some of the parties is arrogant enough to not discuss things. And of course speaking of egomania, who names their lib after themselves? Not only it is long and tedious to write it, it's also pathetic.

@1Hyena Could you please point out where I was "rude to begin with"? I would also like to remind you of the Code of Conduct of this project.

I try to summarize some options we have in this situation. Not all of them were mentioned, but I try to get a picture of all possible ways to proceed:

  • Assignment of invalid Unicode:

    • Option 1: The library should "fix" invalid Unicode input (e.g. by removing invalid parts) and do not throw.
    • Option 2: The library should throw an exception.
    • Option 3: The library should just assign the string (this is the status quo).
  • Serialization of previously stored invalid Unicode:

    • Option 1: This should be made impossible, e.g. by options 1 or 2 above.
    • Option 2: The library should dump the invalid Unicode as is.
    • Option 3: The library should "fix" the invalid Unicode (e.g. by not dumping invalid parts) and proceed.
    • Option 4: The library should throw an exception (this is the status quo).

Did I miss anything?

you do realize that string assignments in its current form are dangerous?

String assignments with unknown, unvalidated contents, yes.

And speaking of @nlohmann he's the one who was rude to begin with,

Uh, no:

You are appearing as someone living in a glass castle on a puffy cloud. Sure, it's easy to whine and moan about abstract ideals like a dogmatic zealot of some particular programming religion.
not to mention that the namespace nlohmann:: is tediously long.
Don't just sweep the problem under the carpet, solve it.
it pretty much leaves you with no other sane solutions than to trim the utf8
Prove me wrong or admit your bias.

A fork should be the last resort, which will unfortunately happen more often if some of the parties is arrogant enough to not discuss things.

There's been plenty of discussion here.

And of course speaking of egomania, who names their lib after themselves? Not only it is long and tedious to write it, it's also pathetic.

It's a fairly common practice for personal projects to use the github username to avoid namespace conflicts.
If you don't want to write it, then C++ has this wonderful feature called using which will get rid of it for you.

Option 2: The library should dump the invalid Unicode as is.

For completeness sake, pointing out that this is the previous behavior, which people didn't like because it meant that the library could create a string that it couldn't parse.

If I am absolutely 100% sure that my strings are always utf8 then of course such trimming would be pointless overhead.

This is a core idea of input sanitization: You should know 100% whether your strings are valid or not. The areas of your code where you allow uncertainty should be at the "edge" and isolated to as small a location as possible so that the rest of your code can be simple and assume validity. If you need to propagate "maybe-valid" strings deeper into a program, then create a class whose sole purpose is to hold them (such that you get type safety and cannot accidentally send possibly-invalid strings into API that require valid ones; you can do the same thing with sensitive strings like passwords). That class can provide unbox methods to turn the "maybe-valid" into valid strings using whatever error recovery/substitution is appropriate for the use-case.

There is no one obvious way to sanitize strings and encodings (as nlohmann is showing by enumerating all the options). It's up to the client to do that validation before they attempt to use this library.

Also, please be more civil in the discourse. It's ok to disagree, argue, and storm out and fork your own vision. But @nlohmann has been a great project maintainer (with thicker skin than I would have in his shoes!). Personal attacks are not helpful and have no place in GitHub. If this entreaty doesn't work, just remember that these comments are forever and a future employer may stumble on them one day -- put your best image forward!

Oh dear.

cough... The Unicode Standard (http://www.unicode.org/versions/Unicode11.0.0/) expressly forbids the fixing or repair of broken Unicode, there are significant Security Issues with doing so.

One may take the view, that removing invalid portions of an input string is "repairing." There are certainly still security issues involved. _e.g., you filter out files with an extension of "exe", someone packages a file called "trojan.ex\0xd8\0xd8e", it passes the filter, but by the time it's recorded in the database, the file has been silently altered to "trojan.exe"

@nlohmann Yes, I believe you did miss something. Or mislabeled it...

My provided POC encoded binary strings that were UTF-8 invalid. It did not fix the UTF-8. It merely encoded the non-UTF-8 values as individual bytes.

The resultant decode would prove identical to the broken input, therefore nothing has been repaired.

It has simply been encoded correctly (if you accept "the way python -mjson.tool would encode it" as the definition of correct.)

The fact that the encoded result will decode without modification to the existing decoder is both a bonus, and supporting evidence that this is indeed the correct way to encode non-conformant strings.

@1Hyena if my JSON library couldn't support encoding the strings I was using, I would use another library or encoding method.

P.S. did nobody notice that I wrote a spec-compliant, fully function, JSON string encoder in less than 38 lines of code? _n.b.: actual function may not be fully functional, spec-complaint, and should never be used in any code, ever._

If design simplicity has any correlation with correctness (and I'm fairly sure it doesn't) ...

Though I would never advise "do what rapidjson does", I was curious, so I quickly compiled up the smallest example program "pretty.cpp" and ran my encoded yet invalid UTF-8 string into it.

    rapidjson $ cat test.json
    {
        "test": "\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\""
    }

    rapidjson $ cat test.json | ./pretty
    {
        "test": "\u0006\t\t└─ä┘´Ø\""
    }

Since the output was binary, I re-ran it through the test program to ensure it could already read the binary version, then through python to ensure the result was the same as the original.

    rapidjson $ cat test.json | ./pretty | ./pretty | python -mjson.tool
    {
        "test": "\u0006\t\t\u2514\u2500\u00e4\u2518\u00b4\u00d8\u008d\""
    }

_n.b. the example program specifically states:_

// This example can only handle UTF-8. For handling other encodings, see prettyauto example.

And just to re-assure everyone that there all input and output was pure 8-bit binary, without any encoding (other than implied UTF-8):

00000000: 7b0a 2020 2020 2274 6573 7422 3a20 225c  {.    "test": "\
00000010: 7530 3030 365c 745c 74e2 9494 e294 80c3  u0006\t\t.......
00000020: a4e2 9498 c2b4 c398 c28d 5c22 220a 7d    ..........\"".}

It's possibly worth noting at this point, that my terminal software (mintty) also has to differentiate between valid and invalid UTF-8 in order to display the results of these tests.

... converse to the last example, here is an example with rapidjson and a "\uDEAD"
(a single unpaired UTF-16 surrogate). This is a somewhat different kettle of fish, as nlohmann::json couldn't generate this, however it is specifically mentioned in the RFC...

rapidjson $ cat invalid.json
{"invalid":"\uDEAD"}

rapidjson $ cat invalid.json | ./pretty.exe
{
    "invalid": "▒"
}

rapidjson $ cat invalid.json | ./pretty.exe | python -mjson.tool
{
    "invalid": "\udead"
}

rapidjson $ cat invalid.json | ./pretty.exe | ./pretty.exe

Error(18): Invalid encoding in string.

From the RFC:

[...] The behavior of software that
receives JSON texts containing such values is unpredictable; for
example, implementations might return different values for the length
of a string value or even suffer fatal runtime exceptions.

The above should not be misread as an excuse for not properly and fully encoding data. Ultimately, I believe that the following paragraph summarizes things nicely.

When all the strings represented in a JSON text are composed entirely
of Unicode characters [UNICODE] (however escaped), then that JSON
text is interoperable in the sense that all software implementations
that parse it will agree on the contents of names and of string
values in objects and arrays.

If there is still some question about which path to take, Douglas Crockford (author of the _now supersceded_ RFC) is available on GitHub. However, he has been known to get very grumpy with people who ask questions for which he considers the answers to be self-evident.

@nlohmann listen, I am talking to you as a friend. I find your lib very useful and if I didn't want to positively contribute, I wouldn't have created this issue, which you have to admit, highlights a controversial and important topic. I got pissed off when I noticed that people are not thinking with me when reading what I am writing. Yes, this is very typical to many people but it is also arrogant and carries the respective implications. I can't blame you for being who you are, but I expect you to then at least accept the consequences. I came here to help you. By not listening to what I had to say things got emotional. If you disagree, then at least explain, as that would be the respectful thing to do. Authority means nothing to me, and if you or anyone here feels that I am not treating @nlohmann with some special respect then tough luck, not everyone in this world is brainwashed into appealing to the authority or seeking desperately the acceptance of others. To me, you are equal the same way my friends are equal to me. I am not afraid to point out their flaws but I will also not hold a personal grudge over a disagreements, and most importantly I am willing to re-think my position and to change my mind if necessary.

That said, back to the topic. You missed out one very important option:

  • Assignment of invalid Unicode
  • Option: The library should ignore invalid Unicode input (e.g. by removing invalid parts) and do not throw.
  • Option: The library should throw an exception.
  • Option: The library should just assign the string (this is the status quo).
  • Option: Options 1 and 2 together, making it possible for the user/dev to decide the behavior (see the code below).
nlohmann::json j_nothrow = (std::nothrow) "\xF0\xA4\xAD\xC0"; // Does not throw.
// The equation oprator should take an optional (std::nothrow) parameter.
// When this is present, the lib should IGNORE invalid utf8 sequences.

For the record, I never said that invalid UTF8 should be fixed (I "fixed" that in your list of options). I said that invalid UTF8 sequences should be IGNORED. There's a big difference. And if this behavior is opt-in via the (std::nothrow) parameter then @sfinktah 's concerns about the file extension example would be neutralized.

@gregmarr
The using namespace mechanism is a horrible creation. No one should ever use it. It only creates a mess.

@sfinktah cough, oh dear, there are also significant security issues related to not surrounding a code that may throw with try-catch blocks. You are completely missing the point, probably deliberately, for what I can tell from your tone. Also you're full of hypocrisy, as usual. You fail to see that nlohmann already has code to validate utf8. If it can be used for validation then it can be trivially modified to be useful for trimming. Your initial UTF8 custom code paranoia really doesn't fit into this context, because as you can see, the lib is already full of it. Already!

Back to the point...

The first rule of exceptions is that the use of exceptions itself should be exceptional. If your code throws, then the responsible thing to do would be surrounding it with the needed try/catch blocks. Does anyone here honestly thinks that all string assignments should be surrounded with such blocks everywhere? Then mr. Lohmann here has a lot of fixing to do in his code examples, I can tell you that. And the lib would become inconvenient to use. I completely get the ideas what the participants here have been pushing, but as a life-long software engineer, I have to remind you that it's okey to "break the rules" sometimes. In this case, the proposed enhancement would make this lib more easy to use for those who know what they are doing. I don't know what your background is but solving everything with exceptions seems like something a dogmatic C++ idealist would do who rarely touches the actual code (and when they do they only create a mess).

Well I do code a lot and string assignments are a big part of the use cases this library provides. Having a healthy dose of paranoia as a software architect, I write resilient code. Code that will not fatally break if some time in the future a rookie developer commits a change to it. I couldn't imagine the code from my application (below) to be littered with try/catch blocks:

if (!users.empty()) {
    nlohmann::json balance_obj = nlohmann::json::object();

    balance_obj["id"] = get_id();

    try {
        balance_obj["account"]  = gets("account");
    } 
    catch (nlohmann::json::type_error& e) {
        // handle the exception
    }

    try {
        balance_obj["currency"] = get_name();
    } 
    catch (nlohmann::json::type_error& e) {
        // handle the exception
    }

    balance_obj["amount"]   = get_available();

    try {
        balance_obj["nice_size"]= get_nice_size();
    } 
    catch (nlohmann::json::type_error& e) {
        // handle the exception
    }

    try {
        balance_obj["nice_heed"]= get_nice_heed();
    } 
    catch (nlohmann::json::type_error& e) {
        // handle the exception
    }

    try {
        balance_obj["nice_name"]= get_nice_name();
    } 
    catch (nlohmann::json::type_error& e) {
        // handle the exception
    }

    for (auto uid : users) {
        to = GM()->instance_must_find(uid)->to_user();
        if (!to || to->is_imaginary()) continue;
        to->event("report_balance", &balance_obj);
    }
}

This, however, is neat to even look at:

// While it could just look like this:

if (!users.empty()) {
    nlohmann::json balance_obj = nlohmann::json::object();
    balance_obj["id"]       = get_id();
    balance_obj["account"]  = (std::nothrow) gets("account");
    balance_obj["currency"] = (std::nothrow) get_name();
    balance_obj["amount"]   = get_available();
    balance_obj["nice_size"]= (std::nothrow) get_nice_size();
    balance_obj["nice_heed"]= (std::nothrow) get_nice_heed();
    balance_obj["nice_name"]= (std::nothrow) get_nice_name();
    for (auto uid : users) {
        to = GM()->instance_must_find(uid)->to_user();
        if (!to || to->is_imaginary()) continue;
        to->event("report_balance", &balance_obj);
    }
}

@sfinktah I would rather not add a dependency to nowide. Any idea whether we can do without. In #1211 you found out that the library is reacting in a way you expected (sorry, I lost track a bit). Do we really need nowide?

Did you just mark my comment as off-topic?

Wow what a toxic community and an authoritarian piece of egomaniac you are @nlohmann . You lost my star here. Resorting to censorship?! That's just low. But it will all get back to you ;) have a nice life

Yes I got it the first time, you don't like what I have got to say. I wonder why. Are you a narcissist? Because I've read those fellows find it hard to cope with criticism. And it aligns with the name of the lib, pretty narcissist one, to be frank.

The library's name is "JSON for Modern C++".

Alright, you won. I will just go back under the bridge now. It's sad things ended up like that though because I really like your lib. And to be honest, you are also quite alright. So if it makes any difference, I am really sorry for any inconveniences I might have caused.

Apologies accepted.

@nlohmann I absolute agree that you should not add a dependency to nowide. It was just a tool I used to create a proof of concept.

Additionally, my POC was flawed, and does not actually work with the original raw input.

"\x06\xe4\xb4\xd8\x8d\x22"

PHP and python additionally refuse to encode it.

sfink@280x tmp $ echo -e '"\xe4\xb4\xd8\x8d"' | python -mjson.tool
'utf8' codec can't decode bytes in position 0-1: invalid continuation byte
sfink@280x tmp $ echo -e '"\xe4\xb4\xd8\x8d"' | phpparse - json_encode
false

JavaScript however, (and it must be remembered that all strings in JavaScript are stored internally as UTF-16), will accept it, but silently adjusts the UTF-8 representation of the string to:

ä´Ø"   ==  (HEX) 06 c3a4 c2b4 c398 c28d 22
(JSON) "\u0006ä´Ø\""

Python will accept that, and tidies it into:

"\u0006\u00e4\u00b4\u00d8\u008d"

Those numbers being the same as the original raw input... which fooled me into using that string instead of the original "\x06\xe4\xb4\xd8\x8d\x22" in some of my testing.

However, they \u version decodes as (UTF-8):

225c 7530 3030 36c3 a4c2 b4c3 98c2 8d22  "\u0006........"

And as such, doesn't match the original input, and it is unknown whether JavaScript's internal UTF-16 encoding is "correct", let alone _desirable._

I think it might be a lost cause, and default behavior should be to throw an exception on encoding.

Of course, it would be nice to be able to override the default decoding for strings...

void to_json(json& j, const std::string& t);
or std::basic_string<T>
or std::basic_string<char_t>

but I'm not holding my breath, as I'm not sure that's really possible in C++.

Score 1 for XML. Perhaps this answers my question: "Why doesn't boost/serialize support JSON"

_for the record_ it is possible to convert a "UTF-ified" binary string, back into the original binary. But it would probably be incorrect for a library to do so.

*** STRING ENCODING TEST
parsing       : {"test":"\u0006\u00e4\u00b4\u00d8\u008d"}
dumping       : {"test":"\u0006ä´Ø"}
string        : 06 c3 a4 c2 b4 c3 98 c2 8d
widened       : 06 00 e4 00 b4 00 d8 00 8d 00
narrowed      : 06 e4 b4 d8 8d
    auto _wide = nowide::widen(value);
    std::string _narrow;
    for (const auto& c : _wide)
        if (c < 0x100)
            _narrow += (char)c;

or if you want to recover a mixture of valid UTF-8 and binary:

    for (size_t n = 0; n < _wide.length(); ++n) {
        if (_wide[n] < 0x100)
            _narrow += (char)_wide[n];
        else
            _narrow += nowide::narrow(_wide.substr(n, 1));
    }

Someone may find that useful one day. It will reverse the encoding done by my POC, or that done by JavaScript.

Interestingly, the whole thing becomes really simple when you operate entirely with std::wstring (inside and outside the JSON library.) Not surprising considering the JavaScript origins of the protocol.

I try to summarize some options we have in this situation. Not all of them were mentioned, but I try to get a picture of all possible ways to proceed:

  • Assignment of invalid Unicode:

    • Option 1: The library should "fix" invalid Unicode input (e.g. by removing invalid parts) and do not throw.
    • Option 2: The library should throw an exception.
    • Option 3: The library should just assign the string (this is the status quo).
  • Serialization of previously stored invalid Unicode:

    • Option 1: This should be made impossible, e.g. by options 1 or 2 above.
    • Option 2: The library should dump the invalid Unicode as is.
    • Option 3: The library should "fix" the invalid Unicode (e.g. by not dumping invalid parts) and proceed.
    • Option 4: The library should throw an exception (this is the status quo).

Did I miss anything?

I too just stumbled over nlohmann::json throwing an exception on serializing an invalid UTF-8 string. I think throwing an exception is a pretty bad reaction in a lot of cases.

So here is my proposal: How about if nlohmann::json followed Python in that it allows to register handlers for Unicode errors and provides some common choices. One of the most useful choices is to automatically replace invalid sequences with the UNICODE REPLACEMENT CHARACTER '�' which makes sure that any automatic fixes are visible and yet you always get a valid sequence. I work with very large text datasets (think search engine) a lot and often it's just the case that >99.999% of it are valid UTF-8 but every couple of gigabytes you get an invalid sequence, in these cases throwing an exception really sucks and ignoring is invisible but replacing means everything works as expected for valid input and leaves a visible marker that there was some bad data while still working for the text around it.

@niklas88

Although I consider myself retired from this topic, your comment is hard to ignore. You have absolutely nailed it here. This indeed solves the problem the best possible way, because if I really wanted to trim the invalid sequences I could just iterate over the resulting string and delete all occurrences of the UNICODE REPLACEMENT CHARACTER. And of course, throwing an exception should itself be an exception. Many C++ programmers get it wrong there. Thanks for this insight. I personally would be pleased with this approach.

Couldn't you just create a function "sanitize_unicode_with_replacement_char(json&)" that recursively crawls the json structure to fix up any invalid strings the way you like? You could do this immediately after importing the structure (ie, reading the file) or right before exporting it (ie, dumping to file) in order to implement it at the right layer for your purpose.

The issue I have with doing anything by default is that _someone_ is going to want something different. So, we'd have to create a policy-based mechanism, which adds a great amount of complexity to the implementation and to the API.

The benefit of the current status quo is that it's "lossless" and permissive in the sense that it allows you to implement _any_ of the above behaviors by doing some extra work on top of what the library does.

everyone's friend is no one's friend

the fatal flaw of the status quo is that it is not obvious and will encourage people to write bad code that may one day cost someone their life, because a non-unicode sequence slips into some life-critical medical system, causing it to get an unexpected exception at the wrong time.

it simply does not feel right to have the possibility of getting an exception from something as basic as a string assignment. IT SHOULD JUST WORK. in the rare cases where you would actually need either a lossless conversion or no conversion at all, you would do the extra work, but not the other way around.

solving all your problems by throwing an exception is a bad practice in software engineering.

Here's another proposal. Make the direct string assignment lossy while still throwing if invalid utf8 was given as an argument to the constructor. Below is example code.

nlohmann::json j_throw    = nlohmann::json("\xF0\xA4\xAD\xC0"); // Does throw.
nlohmann::json j_nothrow1 = "\xF0\xA4\xAD\xC0"; // Does not throw, lossy conversion.
nlohmann::json j_nothrow2 = std::string("\xF0\xA4\xAD\xC0"); // Does not throw, lossy conversion.

Couldn't you just create a function "sanitize_unicode_with_replacement_char(json&)" that recursively crawls the json structure to fix up any invalid strings the way you like? You could do this immediately after importing the structure (ie, reading the file) or right before exporting it (ie, dumping to file) in order to implement it at the right layer for your purpose.

The issue I have with doing anything by default is that _someone_ is going to want something different. So, we'd have to create a policy-based mechanism, which adds a great amount of complexity to the implementation and to the API.

The benefit of the current status quo is that it's "lossless" and permissive in the sense that it allows you to implement _any_ of the above behaviors by doing some extra work on top of what the library does.

I think the problem with the current behavior is that you need to run into it so as to know that it causes an exception. In our case (referenced above) our SPARQL Engine worked completely fine with >100 GB Knowledge Base (Wikidata) only to fail on a smaller one (Freebase) because we were unlucky and cut a UTF-8 sequence at the wrong point or there was just a wrong sequence in the data.

So our system worked fine for more than >100 GB of (nice) text input before we hit that exception.

Also your workaround would mean going over and checking for wrong sequences twice. The current code already detects wrong sequences and reacts to it in a very non-benign way. So all I'm arguing is that it should give the user a way to control how it handles that case and I think allowing the user to provide a function to handle this (maybe even with a default of the current behavior) is easy, elegant and versatile. Especially since checking for UTF-8 encoding correctness while not very hard isn't trivial either and it's very natural to have different preferences of how such an error should be handled (replacing, throwing an exception, ignoring,…)

@nlohmann what do you think?

@niklas88 indeed has a point. same thing happened to me which lead me to the creation of this issue. My program was running fine for more than a year until it started randomly crashing. Throwing from any assignment is pretty unorthodox to begin with. You can't expect people to handle all assignment operators. It is beyond the scope of common sense. I was right from the beginning, I knew it. The handler idea is excellent. I should have thought it myself. Handlers are an ideal concept for resolving matters like this. But I would still emphasize that the default scenario should be the most resilient to all kinds of failures, which means that it should not throw.

@niklas88 I understand your proposal, and we try to borrow a lot of design decisions from Python's dump() function already. Right now, I would agree with https://github.com/nlohmann/json/issues/1198#issuecomment-428645699 that such a sanitation can be done outside of the library. I won't be able to work on this issue in the next three weeks though.

@jaredgrubb 's https://github.com/nlohmann/json/issues/1198#issuecomment-428645699 would work well as a temporary workaround but is in no way recommended as a long-term solution. In fact, it's dangerous, because it requires you to remember doing it every time you need to call json dump to be safe. If a project is huge and has many contributors, this can easily have fatal consequences when a new guy joins the team who doesn't know this lore. So, for anyone reading, we are doing this only as an ugly hack here, but it is not the way you should develop professional software, so be safe! Workarounds like this will come back to haunt you in a couple of years.

Looking at https://docs.python.org/3/library/codecs.html#error-handlers, this seems to be a lot of work and noise in the API if we want to have a similarly extensible approach.

FWIW, I'm very sympathetic. Having the library throw during serialization five bytes from the end is about the worst time to have it happen, and having to _manually_ do these checks means that it's going to get forgotten/overlooked and lead to a vulnerability.

But, it's a hard problem without an obvious (let alone universal) answer. It's made worse by the fact that our STL hasn't even acknowledged that strings might (should? must!) have encoding.

It is already possible to get an invalid UTF-8 sequence in strings while decoding a JSON file ("\\u0080" is a valid JSON-string), so the most consistent option would probably be to replace invalid UTF-8 sequences with \u-escaped sequences instead of throwing an exception while _serializing_ strings. This will result in valid (UTF-8 encoded) JSON files and invalid UTF-8 (or other binary data) would round-trip. It's then up to the user to check the strings stored in a JSON value for correct encodings.

Seems that this issue is finally getting some legit discussion and good proposals. I personally prefer anything to throwing --- be it a handler, be it the \u escape sequence or something else, just do not throw. That said, I'm reopening this as it obviously is still very much of an issue.

It is already possible to get an invalid UTF-8 sequence in strings while decoding a JSON file ("\\u0080" is a valid JSON-string), so the most consistent option would probably be to replace invalid UTF-8 sequences with \u-escaped sequences instead of throwing an exception while _serializing_ strings. This will result in valid (UTF-8 encoded) JSON files and invalid UTF-8 (or other binary data) would round-trip. It's then up to the user to check the strings stored in a JSON value for correct encodings.

Ok, forget this... it's actually nonsense:

E.g., serializing \x80 would result in \\u0080. While deserializing \\u0080, the resulting character U+0080 will be encoded in UTF-8 and result in \xC2\x80.

I opened a work-in-progress branch (https://github.com/nlohmann/json/compare/feature/codec_errors) for the idea sketched in https://github.com/nlohmann/json/issues/1198#issuecomment-428508453. Feedback is more than welcome! The branch currently only handles serialization.

I feel the term "ignore" is unclear about whether the invalid sequence is preserved or dropped. Perhaps "strip" or "remove"? OTOH, I think you're copying Python's terms and that is a valuable thing to do -- so maybe just update the comments to say that "ignore" means the invalid bytes are skipped.

@jaredgrubb not sure if you meant my opening comment, but I added a small edit to clarify what I meant by ignoring. Later I proposed a trim function which always produces a valid UTF8 string. Should probably have named it shave though : P ( can't have strings if you're not shaved, right? )

@jaredgrubb I took the names from Python. You're right the public API needs better documentation. But I would like to focus on the implementation before.

Oh, sorry for the confusion from my comment; it was a reply to @nlohmann's code sketch.

@nlohmann oh wow, thanks for putting in the work, this looks exactly like what I imagined. It turns out, that in our case we will probably just not store the offending data in JSON in the future because we do actually need the broken sequences. Still I think this would be a great addition to your library and make it much more robust.

I overworked the code and added some test cases. The error handler is now a new optional parameter to the dump function. I am not sure whether I exactly implemented Python's behavior, so more test cases may be required.

One good thing about handling this situation in the dump function (rather than during assignment) is that it respects the principle of lazy evaluation. We could argue that not all json instances end up getting dumped and thus checking for invalid UTF8 in those does not make any difference (wastes CPU).

I now think I finished the implementation of the error handlers for the dump function, see #1314. It would be great if someone could have a look.

i just think it will support some other natural language encoding, just like GB18030(gb2312/gbk)、BIG5、etc...if just english letters, they likes utf8, but parse will be throw too. This has caused a lot of trouble for other natural language developers

and,,,jsoncpp can successfully parse, so,

@nlohmann i just want it support parser other natural language encoding

@fawdlstty This library only supports UTF-8, sorry.

Was this page helpful?
0 / 5 - 0 ratings