Node: Non-UTF-8 env vars

Created on 12 Nov 2017 · 17Comments · Source: nodejs/node

Version: v8.8.1
Platform: Linux daurn-z170 4.13.7-1-ARCH #1 SMP PREEMPT Sat Oct 14 20:13:26 CEST 2017 x86_64 GNU/Linux
Subsystem: process

Node.js doesn't seem to have a way to read environment variables that aren't valid unicode. For both keys and values. e.g.

$ env -i $'F\xa5B=BAR' node -e 'console.log(process.env)' 
{ 'F�B': undefined }
$ env -i $'FOO=B\xa5R' node -e 'console.log(process.env.FOO.codePointAt(1))'
65533

feature request process

Source

daurnimator

Most helpful comment

process.env.get is not an option, IMO - that would break existing code that tries to read a property called 'get'.

It also makes the environment accessors slow and awkward since they have to check for these magic properties every time.

What might be acceptable is os.getenv() and os.setenv() methods but then again, it seems too minor to merit new APIs.

The binary data example is flawed and will never work; envvars are C strings, nul bytes terminate them.

bnoordhuis on 16 Nov 2017

👍3

All 17 comments

Issue seems to stem from assumption that env vars are utf8. e.g. In the getter and during enumeration

daurnimator on 12 Nov 2017

UTF-8 is assumed in a lot of other places, environment variables aren't the exception.

The behavior of process.env can't change because of backwards compatibility (i.e., we can't change the encoding to latin-1) and this seems too minor to add a new API for.

Do you have a real-world example where this is an issue and that can't be solved another way?

bnoordhuis on 12 Nov 2017

UTF-8 is assumed in a lot of other places, environment variables aren't the exception.

Most places in e.g. fs also accept a Buffer/Uint8Array.

The behavior of process.env can't change because of backwards compatibility

I think the current behaviour is broken in a way no one expects. One solution might be that unrepresentable keys/values are returned as a Buffer instead of a string with unicode replacement characters?

Do you have a real-world example where this is an issue and that can't be solved another way?

This was found while working on fengari, a port of lua (including lua standard libraries) to javascript. A test found that some env vars were being read in differently in fengari to normal lua. The code path goes past here: https://github.com/fengari-lua/fengari/blob/0a8bc0b30644c514d822fb32a184318f7353a285/src/loslib.js#L180

Different use case:

Passing encryption keys (e.g. a private EC key) into a container via env var. They'd work fine 80% of the time, but once the secret (binary) key contains the wrong sequence, then you get hard to debug failures.

daurnimator on 12 Nov 2017

I think the current behaviour is broken in a way no one expects.

I don’t think this is true; It’s a single data point, but I have never seen environment variables that aren’t conceptually character strings.

One solution might be that unrepresentable keys/values are returned as a Buffer instead of a string with unicode replacement characters?

Sometimes returning a string and sometimes returning a Buffer from the getter is a non-starter, I think.

Passing encryption keys (e.g. a private EC key) into a container via env var. They'd work fine 80% of the time, but once the secret (binary) key contains the wrong sequence, then you get hard to debug failures.

80 % of the time sounds like a very optimistic estimate – A nice property of UTF-8 is that invalid UTF-8 will usually be recognizable as such. Even if you read as little as 8 bytes that are ~ uniformly random (like you’d expect from a private key) you get a 99 % chance of the result not being valid UTF-8.

Also, this is a great use case for base64.

I’m not -1 on Node supporting this, but if it does, that should happen through a separate API.

addaleax on 12 Nov 2017

I guess we could add API for that, something like process.env.get(key[, encoding]), where key can be a Buffer and encoding corresponds to encoding of the result, which is a Buffer by default.

vkurchatkin on 12 Nov 2017

👍1

process.env.get is not an option, IMO - that would break existing code that tries to read a property called 'get'.

It also makes the environment accessors slow and awkward since they have to check for these magic properties every time.

What might be acceptable is os.getenv() and os.setenv() methods but then again, it seems too minor to merit new APIs.

The binary data example is flawed and will never work; envvars are C strings, nul bytes terminate them.

bnoordhuis on 16 Nov 2017

👍3

What might be acceptable is os.getenv() and os.setenv() methods

Also need to be able to iterate env vars.

What about just pushing environ as a Buffer? And it's up to users to parse it themselves?

daurnimator on 16 Nov 2017

I'd be okay with a set of functions that work with Buffers.

This currently also leads to an issue with roundtripping right now, which is bad:

$ env -i $'FOR=abc' node -p 'process.env[Object.keys(process.env)[0]]'
abc
$ env -i $'F\xa5R=abc' node -p 'process.env[Object.keys(process.env)[0]]'
undefined

TimothyGu on 16 Nov 2017

What about just pushing environ as a Buffer? And it's up to users to parse it themselves?

I've toyed with that idea. Constructing an ArrayBuffer mapping environ and moving parsing to JS land is definitely an option, that would let us remove about 150 lines of finicky C++ code.

Some special-casing is needed for Windows because its environment is UCS-2 but otherwise it's pretty straightforward.

bnoordhuis on 17 Nov 2017

I've toyed with that idea. Constructing an ArrayBuffer mapping environ and moving parsing to JS land is definitely an option, that would let us remove about 150 lines of finicky C++ code.

Is anyone working on this? Should I start work on it? (I'll have to set up a node dev environment...)

daurnimator on 11 Jan 2018

No one's working on it, but on second thought, you also need a way to sync changes from the ArrayBuffer back to the global environ.

It's probably easier to keep using getenv() / GetEnvironmentVariableW() / etc. but expose (for example) a process.envb that's a class extends Map that only accepts Buffer and Uint8Array instances as keys and values and internally calls out to the platform-native APIs.

bnoordhuis on 11 Jan 2018

class extends Map that only accepts Buffer and Uint8Array instances as keys and values and internally calls out to the platform-native APIs.

But Maps consider different Uint8Array/Buffer keys to be different keys.
I think this is a false-equivalence that is assured to confuse....

daurnimator on 11 Jan 2018

That's a fair point.

bnoordhuis on 11 Jan 2018

I've put this into https://github.com/nodejs/node/projects/13 backlog

refack on 11 Nov 2018

Put into https://github.com/nodejs/node/projects/13 backlog

Is there some more context on this I could read?
Closing issues in a backlog seems like an odd process.

daurnimator on 11 Nov 2018

Hello @daurnimator,
It's actually a new thing I'm suggesting. As I see it we have two issues this can solve: (1) We have too many open issues (2) Among them there are feature requests like this, that have no objections, but also isn't picked up by anyone.

The idea behind the list in https://github.com/nodejs/node/projects/13 is to surface these, both the Collaborators and to members of the community who are looking to contribute. The "close" is just to separate these from bugs.

If GitHub would give us a third item type, that would have been optimal. Until then we are trying to find the best way to use the available metadata options to have visibility into what's going on.

refack on 12 Nov 2018

Yet another approach could be to follow what Python 3 does with its surrogateescape method, such as: "\xff" -> "\udcff". This has the advantage of having to neither introduce a new API nor break compatibility with existing codes. Currently node does substitute them with \uFFFD (REPLACEMENT CHARACTER), which indeed has information loss issue compared to the aforementioned method. Of course Windows uses Unicode natively for them, but then we have the option of detecting what OS the application is running on.