Bug Report
develop branch?Feature Request
Describe the feature in as much detail as possible.
Include sample usage where appropriate.
Related: #722 #216
wstring is currently not supported.
That's pity. Thank you.
Well, I think it this could be fixed by writing an input adapter for std::wstring, std::u16string, and std::u32string. I'll see what I can do.
So my naive approach would be to implement an input adapter for std::wstring, std::u16string, and std::u32string and take care of the translation between UTF-32 and UTF-16 to UTF-8, respectively. This adapter would read a character from the string and internally offer several UTF-8 bytes to the consumer of the input adapter. This would allow to reuse the lexer/parser chain for JSON files.
I shall open a feature branch, but would be happy for any comment whether I overlooked some fundamental flaw in this idea.
On a previous project I derived a string class from std::string that gave me my utf-8/16/32 support. My suggestion is to consider doing the same.
The benefit would be that it would also give generic utf string support everywhere else where you would use it. ie: json::string instead of std::string. There are open source versions floating around but it is easy to roll your own.
But that's my 2 cents worth.
@us-er-name I don't fully understand. You mean to use a Unicode-friendly user-defined string class which is basically std::string plus some conversion functions? But how does this play with templating the JSON string type?
I apologise if I'm way off base here. To answer your first question: yes.
Once, in my own code, I was having difficulty supporting different string types so I finally got tired of it and wrote my own string class that was derived off std::string. I added utf-8/16/32 support and used it instead. I replaced all std::string in my code to myspace::string and that was it. Now I was able to pass wstring, u16string, u32string and cast back and forth without problems. The conversion was minor in my case.
You have a lot of code so I didn't go over it in any detail but I was wondering if you could do the same and then you would also have the benefit of having a generic string class that could be used for other purposes as well. Either way it is done, it mainly boils down to a set of routines converting to and from utf-8, utf-16, utf-32.
In the end, it winds up to have the same result - your code internally will be utf friendly. My suggestion was to create a utf string class that you use so that the same objective is achieved plus the added benefit of creating this utf friendly string class that could be used by itself as well - an extra bonus.
ie:
using utf = nlohmann::utf; // namespace nlohmann { namespace utf { class string : public std::string { ...
now:
utf::string mystring; // Unicode friendly string
Something like that. Just a quick glance it seems like you have made provisions to support others than just std::string. I see in a template definition:
class StringType = std::string
Would be nice if you had your own utf string class and then it could be changed (using my first example) to:
class StringType = utf::string
As I said, I may be way off base here. Sometimes things like this go easy and sometimes it can't be done without major rework. You tagged this thread as "please discuss" so this was just a suggestion (assuming it could be done). You know your own code much better than me.
Thanks for clarifying!
My current take would be to use UTF-8 internally and convert UTF-16 and UTF-32 strings into UTF-8 while parsing. This would then allow any user-defined string type to be used in the library as long as it is fine with UTF-8. Another place for recoding would then be the serialization function which currently assumes UTF-8. Furthermore, assignment from/to strings would need similar code as parsing/serializing.
A Unicode-friendly string class would simplify a lot (the parser would need to be adjusted), but I have no idea how to mix this with the option to pass a user-defined string type to the library. And the latter is used by some folks on embedded devices which implement all kinds of optimizations into their strings.
We're pretty much talking the same thing. I was suggesting using UTF-8 internally, storing it in a std::string. You already use std::string internally. The new class would be just to encapsulate the UTF-16 and UTF-32 methods instead of being parts of other code. So it is really just an enhanced version of std::string. If anyone else has used a customised string type then this should not interfere.
I believe this satisfies your parts of your definition: "My current take would be to use UTF-8 internally" and "This would then allow any user-defined string type to be used in the library as long as it is fine with UTF-8". The only difference is where conversions are done.
I tried to quickly write this utf class to show you (and you could use if you liked it), but I got stuck in the code section:
template<> struct hash<nlohmann::json>
Your default is to use std::string in the hash - for which there is a template specialisation but my derived class does not. Thus my compile fails with a deleted constructor error. I've never delved deep into std:: stuff, especially std::hash. So I am unable to get my class to compile. I also have not kept up with the latest c++ template goodies so it is quickly becoming a painful experience. Except for this hash stuff, everything else seems simple to switch over to my new class.
I noticed that your code does not handle UTF-8 at the moment but this class could help in that.
ie: You use std::string::find_first_of() which is not UTF-8 aware. If you had your own string class then you would simply rewrite it.
I've decided to post what I have so far. Except for the hash stuff, the rest compiles. Obviously no chance to debug. At least it shows what I was aiming for.
#include <string> // string, stoi, to_string
namespace utf {
class string : public std::string {
public:
string() {}
string(size_type n, char c) { assign(n, c); }
string(const char * s, std::size_t n) { assign(s, n); }
string(const char * s) { assign(s); }
string(const string & s) { *this = s; }
string(std::string & s) { assign(s); }
friend string operator+(const string& lhs, const string& rhs) { return (std::string)lhs + (std::string)rhs; }
friend string operator+(const string& lhs, const char * rhs) { return (std::string)lhs + rhs; }
friend string operator+(const char * lhs, const string& rhs) { return lhs + (std::string)rhs; }
friend string operator+(const string& lhs, char rhs) { return (std::string)lhs + rhs; }
friend string operator+(char lhs, const string& rhs) { return lhs + (std::string)rhs; }
// operator wstring() const; // TODO
// operator std::u16string() const; // TODO
// operator std::u32string() const; // TODO
// string(const wchar_t *); // TODO
// string(const char16_t *); // TODO
// string(const char32_t *); // TODO
// Any others you find useful
};
string to_string(int id_) { return std::to_string(id_); }
}
I used the single header file and changed all std::string to utf::string. If you do a global string replace, watch out for std::stringstream. Leave that as is.
There are two references to std::to_string which should be changed to utf::to_string.
The line:
auto reference_token = reference_string.substr(start, slash - start);
should have the keyword 'auto' changed to utf::string
As mentioned, the only compile errors should be with the hash stuff.
So close. sighs
I noticed that your code does not handle UTF-8 at the moment but this class could help in that.
ie: You use std::string::find_first_of() which is not UTF-8 aware. If you had your own string class then you would simply rewrite it.
I am handling UTF-8 - the std::string::find_first_of() call is only used for / and ~ inside the JSON Pointer implementation, so I do not need to take care about encoding.
Yes. I thought of that exact case way later - that those routines were internal somehow and utf handling there was not necessary. Sorry about that.
FYI: I fixed the hash problem and have the demo code working.
I added this after the class definition:
namespace std {
template <> struct hash<utf::string>
{
size_t operator()(const utf::string & s) const
{
return hash<utf::string>()(s);
}
};
}
Now (in my code that uses your json code) when I have a primitive string type, all I do is:
// Primitive type pointed to by 'val'
if (val->is_string()) {
utf::string str = (utf::string)*val;
}
I dropped this into something I wrote that uses your json code and it seems to work fine. Of course I had to do the changes I mentioned under that code. That is not a comprehensive test of course.
All that would be needed is to code the TODO tagged methods to convert to/from UTF-8 for the various formats.
In short, I proposed this solution since I believe your code should keep working at the lowest common denominator: std::string. You already support UTF-8 in it. Any UTF-16 or UTF-32 support I felt should be just conversions of the std::string that holds your UTF-8. Thus I wrote this utf class to demonstrate it.
Your proposal is also a perfectly valid alternative.
Naturally, do what you feel is best. I just wanted to show an alternate solution. For as many programmers put on a problem, there will be at least that many different solutions.
Keep up the good work!
I am not sure how this would work if someone uses her own string type as template to basic_json. This is the reason I put all encoding stuff into the input adapters to make sure the string type inside the basic_json class does not need to cope with this.
Just a little comment on class string : public std::string
You should not inherit from stl containers, they do not have a virtual Dtor and are not meant to be used as base class.
To read more about it do a quick google search for e.g. "inherit std::string".
@Bosswestfalen Thanks for the heads up! That is a fair comment to make. Also, there are pitfalls when doing this - as anyone googling your search will see. To be honest, I do not completely agree with all points I've read but I strongly agree with most of it. _Readers of this thread should take note of what you said and at least read up on this if they are not familiar with the subject but want to do something similar._
In my case I specifically made sure the footprint does not change. I need no ctor or dtor other than what is supplied by std::string. No other new stuff was added that changed how the underlying std::string works. Certainly I do not propose making any changes to this class that would alter this.
It's just a convenient wrapper class.
In my first google hit from your suggested search I found someone stating:
"There is absolutely no reason to publicly derive a class in C++ if you're not trying to do something polymorphic. The language comes with free functions as a standard feature of the language, and free functions are what you should be using here.
Think of it this way -- do you really want to force clients of your code to convert to using some proprietary string class simply because you want to tack on a few methods?"
My answer is "in some cases, yes". It's called "encapsulation". Otherwise IMO we might as well go back to C. I'm not just wrapping a few functions (although technically I am). I was attempting to create a UTF friendly class in a C++ way so that the OP's concerns were addressed. Even nlohmann stated "A Unicode-friendly string class would simplify a lot".
I could simply copy some open source std::string implementation and add a few methods to it to make it UTF friendly. Technically that would address Bosswestfalen's concerns too - but that seems like a silly way to go. And we'd still have the same thing as I proposed.
Do we petition the standards body for c++ to come up with utf conversion functions? Do we force users to employ workarounds simply because "we can but we're told we shouldn't"? At the top of the thread, the OP replied: "That's pity. Thank you." I simply wanted to address the original OP's needs while not fundamentally changing anything this source was doing: UTF-8 in a std::string.
Yes, C++ has a bigger noose to allow more people to hang themselves. I always worry that I am winding up doing just that. That's why I decided to implement what I proposed - to make sure I wasn't doing just that.
I've heard different points of view on this and other things like "reinterpret_cast should never be allowed", "removing the const attribute" is bad, etc. One has to weigh the experience of others along with your needs and decide what is "right" on a case by case basis. heck, even static_cast has caveats - do we stop using it?
Anyways, in the end, my proposal was just that - a proposal that would possibly fit the needs for all. It, in my opinion, addresses these needs without any negative side effects.
If nlohmann's solution works, then I suggest going that way.
Why? First and foremost: He is the maintainer of this code. It is better to have him confident of his own code than to hesitantly adopt something he is unsure of - as stated in his last post.
This is paramount to anything else.
Secondly: Everyone will be happy to use std::string as is. Thus there will be even less work for people to use this code, while integrating UTF-16 and UTF-32.
I really don't want this to degenerate into a debate. This is an "issue thread" and I replied only since nlohmann marked it as "state: please discuss". Any concerns that fall under this "state: please discuss" naturally belong here. Any debate with me, I ask you to email me directly.
Thanks for reading.
Any other comments on this?
Merged.
Most helpful comment
Just a little comment on
class string : public std::stringYou should not inherit from stl containers, they do not have a virtual Dtor and are not meant to be used as base class.
To read more about it do a quick google search for e.g. "inherit std::string".