Meson: configure_file fails for files encoded with not utf-8 or ascii

Created on 29 Mar 2017 · 18Comments · Source: mesonbuild/meson

I'm working on wrapping a project (freeglut), and one of the files in the project that need to be configured (using configure_file) is encoded in ISO-8859, and contains an character that isn't compatible with utf-8. meson dies trying to configure said file.

Here's the output:

Could not read input file /home/user/source/mesa-demos/subprojects/freeglut-3.0.0/freeglut.rc.in: 'utf-8' codec can't decode byte 0xa9 in position 740: invalid start byte

I don't know if this is a case really worth solving (how many projects use encodings other than ascii or utf-8?). Since it's just changing the encoding I've just copied the file into my wrap patch.

windows bug generators i18n meswaylanxorg

Source

dcbaker

All 18 comments

What character is it? We could probably try harder to decode files from such ISO encodings.

nirbheek on 30 Mar 2017

Why should it be decoded, at all?
Any file is just sequence of bytes.

msink on 30 Mar 2017

Meson files are in utf-8 always so all the configure defitions are in utf-8. Mixing two encodings is just asking for trouble.

jpakkane on 30 Mar 2017

configure_file() input needs to be decoded because we do substitutions on it before writing it out, and we must be able to guarantee that you can use arbitrary UTF-8 inside those substitutions.

However, there is a lossless mapping from the extended Latin charsets (ISO-8859 and others), so we should be able to decode it to UTF-8.

nirbheek on 30 Mar 2017

But how do you detect which ISO-8859-X it is? Just try all until one works? And then you write out the file in the same encoding or UTF-8?

tp-m on 30 Mar 2017

Then I see only one solution - add new kwarg encoding.

msink on 30 Mar 2017

All encoding detection (without some a header or magic number declaring it, which is required for some files such as Python) is heuristic, and it looks like str.decode() is bad at it? I expect (hope) there's a python module for doing that detection. If there isn't, we should require people to be sane and use UTF-8 for all files.

nirbheek on 30 Mar 2017

There are two use cases here

transition of old projects to Meson
needing to output ISO-8859-1 or something

For 1 we can just say to copy the source file to a different name while you have two build systems and then delete the old one. Or, better yet, convert to UTF-8 in the existing system. It should be doable for most cases and in fact should probably be done regardless of the build system. Having non-utf-8 source these days is just asking for trouble. Even VS supports it.

Case 2 is something that could require adding an encoding keyword. However that requires proof of usage in the real world and in several different projects.

jpakkane on 30 Mar 2017

Python only does explict decoding from binary data to or treats the data as a byte-stream, and the caller is responsible for doing the right thing. Unfortunately python doesn't have a stdlib module for doing autodetection, python-magic is generally used for this at least on *nix and macOS.

Honestly from my point of view it only makes sense to support non-utf8 for projects being wrapped, if you're porting your project to meson, just use utf-8. And that seems like kinda a niche and difficult to implement in such a way that only wraps use it, so maybe just catching the decode error and printing a nicer error message would be sufficient, IMHO.

dcbaker on 30 Mar 2017

I've hit the same problem in gtk+ with Windows resource files (.rc) which are either in the ANSI (mbcs) encoding or utf-16. I've worked around it by removing all non-ASCII for now.

Just to add another potential use case for this.

lazka on 18 Jul 2017

@lazka that should be fixed on Windows with Python 3.7, I think: https://www.python.org/dev/peps/pep-0538/

nirbheek on 18 Jul 2017

Actually, it should already be fixed in Python 3.6. Can you try with that?

nirbheek on 18 Jul 2017

I don't think this is related. Afaics this issue is about the file content encoding which should be locale/code page independent anyway.

lazka on 18 Jul 2017

No, it talks about both console encoding and filesystem encoding.

nirbheek on 18 Jul 2017

By "this issue" I meant #1542 :) console and filesystem encoding are not relevant here.

lazka on 18 Jul 2017

Same problem here with this file: https://github.com/marcosps/kernel_experiments/blob/master/userspace/namespaces/completions/bash/ns_exec

Using file, it returns us-ascci:
ns_exec: text/x-shellscript; charset=us-ascii

Would this be a problem to meson?

marcosps on 13 Oct 2017

Fact of life is that you can't change the world, you have to change your program. I added encoding keyword, a test and updated docs in #3135. Hopefully this is acceptable for the meson devs.