Json: Can this class work with wchar_t / std::wstring?

Created on 29 Mar 2018 · 17Comments · Source: nlohmann/json

Bug Report

What is the issue you have?
Failed to parse wchar_t string
Please describe the steps to reproduce the issue. Can you provide a small but working code example?
json::parse(wstring) failed
What is the expected behavior?
the string was parsed
And what is the actual behavior instead?
error
Which compiler and operating system are you using? Is it a [supported compiler]
VS 2017 Community edition
Did you use a released version of the library or the version from the develop branch?
Took from developer branch
If you experience a compilation error: can you compile and run the unit tests?

Feature Request

Describe the feature in as much detail as possible.
Include sample usage where appropriate.

enhancemenimprovement proposed fix

Source

htonus

Most helpful comment

Just a little comment on class string : public std::string
You should not inherit from stl containers, they do not have a virtual Dtor and are not meant to be used as base class.

To read more about it do a quick google search for e.g. "inherit std::string".

Bosswestfalen on 4 Apr 2018

👍4

All 17 comments

Related: #722 #216

nlohmann on 29 Mar 2018

wstring is currently not supported.

nlohmann on 29 Mar 2018

That's pity. Thank you.

htonus on 30 Mar 2018

Well, I think it this could be fixed by writing an input adapter for std::wstring, std::u16string, and std::u32string. I'll see what I can do.

nlohmann on 30 Mar 2018

👍3

So my naive approach would be to implement an input adapter for std::wstring, std::u16string, and std::u32string and take care of the translation between UTF-32 and UTF-16 to UTF-8, respectively. This adapter would read a character from the string and internally offer several UTF-8 bytes to the consumer of the input adapter. This would allow to reuse the lexer/parser chain for JSON files.

I shall open a feature branch, but would be happy for any comment whether I overlooked some fundamental flaw in this idea.

nlohmann on 1 Apr 2018

On a previous project I derived a string class from std::string that gave me my utf-8/16/32 support. My suggestion is to consider doing the same.

The benefit would be that it would also give generic utf string support everywhere else where you would use it. ie: json::string instead of std::string. There are open source versions floating around but it is easy to roll your own.

But that's my 2 cents worth.

us-er-name on 2 Apr 2018

@us-er-name I don't fully understand. You mean to use a Unicode-friendly user-defined string class which is basically std::string plus some conversion functions? But how does this play with templating the JSON string type?

nlohmann on 2 Apr 2018

I apologise if I'm way off base here. To answer your first question: yes.

Once, in my own code, I was having difficulty supporting different string types so I finally got tired of it and wrote my own string class that was derived off std::string. I added utf-8/16/32 support and used it instead. I replaced all std::string in my code to myspace::string and that was it. Now I was able to pass wstring, u16string, u32string and cast back and forth without problems. The conversion was minor in my case.

You have a lot of code so I didn't go over it in any detail but I was wondering if you could do the same and then you would also have the benefit of having a generic string class that could be used for other purposes as well. Either way it is done, it mainly boils down to a set of routines converting to and from utf-8, utf-16, utf-32.

In the end, it winds up to have the same result - your code internally will be utf friendly. My suggestion was to create a utf string class that you use so that the same objective is achieved plus the added benefit of creating this utf friendly string class that could be used by itself as well - an extra bonus.

ie:
using utf = nlohmann::utf; // namespace nlohmann { namespace utf { class string : public std::string { ...
now:
utf::string mystring; // Unicode friendly string

Something like that. Just a quick glance it seems like you have made provisions to support others than just std::string. I see in a template definition:
class StringType = std::string
Would be nice if you had your own utf string class and then it could be changed (using my first example) to:
class StringType = utf::string

As I said, I may be way off base here. Sometimes things like this go easy and sometimes it can't be done without major rework. You tagged this thread as "please discuss" so this was just a suggestion (assuming it could be done). You know your own code much better than me.

us-er-name on 2 Apr 2018

Thanks for clarifying!

My current take would be to use UTF-8 internally and convert UTF-16 and UTF-32 strings into UTF-8 while parsing. This would then allow any user-defined string type to be used in the library as long as it is fine with UTF-8. Another place for recoding would then be the serialization function which currently assumes UTF-8. Furthermore, assignment from/to strings would need similar code as parsing/serializing.

A Unicode-friendly string class would simplify a lot (the parser would need to be adjusted), but I have no idea how to mix this with the option to pass a user-defined string type to the library. And the latter is used by some folks on embedded devices which implement all kinds of optimizations into their strings.

nlohmann on 2 Apr 2018

👍1

We're pretty much talking the same thing. I was suggesting using UTF-8 internally, storing it in a std::string. You already use std::string internally. The new class would be just to encapsulate the UTF-16 and UTF-32 methods instead of being parts of other code. So it is really just an enhanced version of std::string. If anyone else has used a customised string type then this should not interfere.

I believe this satisfies your parts of your definition: "My current take would be to use UTF-8 internally" and "This would then allow any user-defined string type to be used in the library as long as it is fine with UTF-8". The only difference is where conversions are done.

I tried to quickly write this utf class to show you (and you could use if you liked it), but I got stuck in the code section:

template<> struct hash<nlohmann::json>

Your default is to use std::string in the hash - for which there is a template specialisation but my derived class does not. Thus my compile fails with a deleted constructor error. I've never delved deep into std:: stuff, especially std::hash. So I am unable to get my class to compile. I also have not kept up with the latest c++ template goodies so it is quickly becoming a painful experience. Except for this hash stuff, everything else seems simple to switch over to my new class.

I noticed that your code does not handle UTF-8 at the moment but this class could help in that.
ie: You use std::string::find_first_of() which is not UTF-8 aware. If you had your own string class then you would simply rewrite it.

I've decided to post what I have so far. Except for the hash stuff, the rest compiles. Obviously no chance to debug. At least it shows what I was aiming for.

#include <string> // string, stoi, to_string
namespace utf {
    class string : public std::string {
    public:
        string() {}
        string(size_type n, char c) { assign(n, c); }
        string(const char * s, std::size_t n) { assign(s, n); }
        string(const char * s) { assign(s); }
        string(const string & s) { *this = s; }
        string(std::string & s) { assign(s); }
        friend string operator+(const string& lhs, const string& rhs) { return (std::string)lhs + (std::string)rhs; }
        friend string operator+(const string& lhs, const char * rhs) { return (std::string)lhs + rhs; }
        friend string operator+(const char * lhs, const string& rhs) { return lhs + (std::string)rhs; }
        friend string operator+(const string& lhs, char          rhs) { return (std::string)lhs + rhs; }
        friend string operator+(char          lhs, const string& rhs) { return lhs + (std::string)rhs; }

//      operator wstring() const;           // TODO
//      operator std::u16string() const;    // TODO
//      operator std::u32string() const;    // TODO
//      string(const wchar_t *);            // TODO
//      string(const char16_t *);           // TODO
//      string(const char32_t *);           // TODO
// Any others you find useful
    };
    string to_string(int id_) { return std::to_string(id_); }
}

I used the single header file and changed all std::string to utf::string. If you do a global string replace, watch out for std::stringstream. Leave that as is.
There are two references to std::to_string which should be changed to utf::to_string.
The line:

auto reference_token = reference_string.substr(start, slash - start);

should have the keyword 'auto' changed to utf::string

As mentioned, the only compile errors should be with the hash stuff.

So close. sighs

us-er-name on 3 Apr 2018

I noticed that your code does not handle UTF-8 at the moment but this class could help in that.
ie: You use std::string::find_first_of() which is not UTF-8 aware. If you had your own string class then you would simply rewrite it.

I am handling UTF-8 - the std::string::find_first_of() call is only used for / and ~ inside the JSON Pointer implementation, so I do not need to take care about encoding.

nlohmann on 3 Apr 2018

Yes. I thought of that exact case way later - that those routines were internal somehow and utf handling there was not necessary. Sorry about that.

FYI: I fixed the hash problem and have the demo code working.
I added this after the class definition:

namespace std {
    template <> struct hash<utf::string>
    {
        size_t operator()(const utf::string  & s) const
        {
            return hash<utf::string>()(s);
        }
    };
}

Now (in my code that uses your json code) when I have a primitive string type, all I do is:

    // Primitive type pointed to by 'val'
    if (val->is_string()) {
        utf::string str = (utf::string)*val;
    }

I dropped this into something I wrote that uses your json code and it seems to work fine. Of course I had to do the changes I mentioned under that code. That is not a comprehensive test of course.

All that would be needed is to code the TODO tagged methods to convert to/from UTF-8 for the various formats.

In short, I proposed this solution since I believe your code should keep working at the lowest common denominator: std::string. You already support UTF-8 in it. Any UTF-16 or UTF-32 support I felt should be just conversions of the std::string that holds your UTF-8. Thus I wrote this utf class to demonstrate it.

Your proposal is also a perfectly valid alternative.

Naturally, do what you feel is best. I just wanted to show an alternate solution. For as many programmers put on a problem, there will be at least that many different solutions.

Keep up the good work!

us-er-name on 3 Apr 2018

I am not sure how this would work if someone uses her own string type as template to basic_json. This is the reason I put all encoding stuff into the input adapters to make sure the string type inside the basic_json class does not need to cope with this.

nlohmann on 3 Apr 2018

Just a little comment on class string : public std::string
You should not inherit from stl containers, they do not have a virtual Dtor and are not meant to be used as base class.

To read more about it do a quick google search for e.g. "inherit std::string".

Bosswestfalen on 4 Apr 2018

👍4

@Bosswestfalen Thanks for the heads up! That is a fair comment to make. Also, there are pitfalls when doing this - as anyone googling your search will see. To be honest, I do not completely agree with all points I've read but I strongly agree with most of it. _Readers of this thread should take note of what you said and at least read up on this if they are not familiar with the subject but want to do something similar._

In my case I specifically made sure the footprint does not change. I need no ctor or dtor other than what is supplied by std::string. No other new stuff was added that changed how the underlying std::string works. Certainly I do not propose making any changes to this class that would alter this.

It's just a convenient wrapper class.

In my first google hit from your suggested search I found someone stating:

"There is absolutely no reason to publicly derive a class in C++ if you're not trying to do something polymorphic. The language comes with free functions as a standard feature of the language, and free functions are what you should be using here.

Think of it this way -- do you really want to force clients of your code to convert to using some proprietary string class simply because you want to tack on a few methods?"

My answer is "in some cases, yes". It's called "encapsulation". Otherwise IMO we might as well go back to C. I'm not just wrapping a few functions (although technically I am). I was attempting to create a UTF friendly class in a C++ way so that the OP's concerns were addressed. Even nlohmann stated "A Unicode-friendly string class would simplify a lot".

I could simply copy some open source std::string implementation and add a few methods to it to make it UTF friendly. Technically that would address Bosswestfalen's concerns too - but that seems like a silly way to go. And we'd still have the same thing as I proposed.

Do we petition the standards body for c++ to come up with utf conversion functions? Do we force users to employ workarounds simply because "we can but we're told we shouldn't"? At the top of the thread, the OP replied: "That's pity. Thank you." I simply wanted to address the original OP's needs while not fundamentally changing anything this source was doing: UTF-8 in a std::string.

Yes, C++ has a bigger noose to allow more people to hang themselves. I always worry that I am winding up doing just that. That's why I decided to implement what I proposed - to make sure I wasn't doing just that.

I've heard different points of view on this and other things like "reinterpret_cast should never be allowed", "removing the const attribute" is bad, etc. One has to weigh the experience of others along with your needs and decide what is "right" on a case by case basis. heck, even static_cast has caveats - do we stop using it?

Anyways, in the end, my proposal was just that - a proposal that would possibly fit the needs for all. It, in my opinion, addresses these needs without any negative side effects.

If nlohmann's solution works, then I suggest going that way.

Why? First and foremost: He is the maintainer of this code. It is better to have him confident of his own code than to hesitantly adopt something he is unsure of - as stated in his last post.

This is paramount to anything else.

Secondly: Everyone will be happy to use std::string as is. Thus there will be even less work for people to use this code, while integrating UTF-16 and UTF-32.

I really don't want this to degenerate into a debate. This is an "issue thread" and I replied only since nlohmann marked it as "state: please discuss". Any concerns that fall under this "state: please discuss" naturally belong here. Any debate with me, I ask you to email me directly.

Thanks for reading.

us-er-name on 4 Apr 2018

Any other comments on this?

nlohmann on 8 Apr 2018

Merged.

nlohmann on 29 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings