I'm working on wrapping a project (freeglut), and one of the files in the project that need to be configured (using configure_file) is encoded in ISO-8859, and contains an character that isn't compatible with utf-8. meson dies trying to configure said file.
Here's the output:
Could not read input file /home/user/source/mesa-demos/subprojects/freeglut-3.0.0/freeglut.rc.in: 'utf-8' codec can't decode byte 0xa9 in position 740: invalid start byte
I don't know if this is a case really worth solving (how many projects use encodings other than ascii or utf-8?). Since it's just changing the encoding I've just copied the file into my wrap patch.
What character is it? We could probably try harder to decode files from such ISO encodings.
Why should it be decoded, at all?
Any file is just sequence of bytes.
Meson files are in utf-8 always so all the configure defitions are in utf-8. Mixing two encodings is just asking for trouble.
configure_file() input needs to be decoded because we do substitutions on it before writing it out, and we must be able to guarantee that you can use arbitrary UTF-8 inside those substitutions.
However, there is a lossless mapping from the extended Latin charsets (ISO-8859 and others), so we should be able to decode it to UTF-8.
But how do you detect which ISO-8859-X it is? Just try all until one works? And then you write out the file in the same encoding or UTF-8?
Then I see only one solution - add new kwarg encoding.
All encoding detection (without some a header or magic number declaring it, which is required for some files such as Python) is heuristic, and it looks like str.decode() is bad at it? I expect (hope) there's a python module for doing that detection. If there isn't, we should require people to be sane and use UTF-8 for all files.
There are two use cases here
For 1 we can just say to copy the source file to a different name while you have two build systems and then delete the old one. Or, better yet, convert to UTF-8 in the existing system. It should be doable for most cases and in fact should probably be done regardless of the build system. Having non-utf-8 source these days is just asking for trouble. Even VS supports it.
Case 2 is something that could require adding an encoding keyword. However that requires proof of usage in the real world and in several different projects.
Python only does explict decoding from binary data to
Honestly from my point of view it only makes sense to support non-utf8 for projects being wrapped, if you're porting your project to meson, just use utf-8. And that seems like kinda a niche and difficult to implement in such a way that only wraps use it, so maybe just catching the decode error and printing a nicer error message would be sufficient, IMHO.
I've hit the same problem in gtk+ with Windows resource files (.rc) which are either in the ANSI (mbcs) encoding or utf-16. I've worked around it by removing all non-ASCII for now.
Just to add another potential use case for this.
@lazka that should be fixed on Windows with Python 3.7, I think: https://www.python.org/dev/peps/pep-0538/
Actually, it should already be fixed in Python 3.6. Can you try with that?
I don't think this is related. Afaics this issue is about the file content encoding which should be locale/code page independent anyway.
No, it talks about both console encoding and filesystem encoding.
By "this issue" I meant #1542 :) console and filesystem encoding are not relevant here.
Same problem here with this file: https://github.com/marcosps/kernel_experiments/blob/master/userspace/namespaces/completions/bash/ns_exec
Using file, it returns us-ascci:
ns_exec: text/x-shellscript; charset=us-ascii
Would this be a problem to meson?
Fact of life is that you can't change the world, you have to change your program. I added encoding keyword, a test and updated docs in #3135. Hopefully this is acceptable for the meson devs.
I think #3383 closed this by accident