Newman: Newman JSON parser cannot handle files (environment, possibly test collection) having byte-order marker (Unicode data)

Created on 28 Jan 2019 · 11Comments · Source: postmanlabs/newman

We're attempting to automate data-driven API testing by generating a Postman environment file including query output for use with Newman.

We're using the standard Windows Powershell cmdlets for handling JSON data and text files to do this, and they work well except: without specifying an encoding, Out-File encodes using UTF16-LE _with byte order marker_. Newman chokes on the BOM, emitting this error:

" i ...'token � in JSON at position 0 while parsing near '��{

The same environment file imports successfully into Postman.

Explicitly specifying ASCII encoding when generating the environment file allows it to work with Newman, but there is the possibility the test data will contain international text (accented characters, oriental glyphs, etc.) and encoding to ASCII may break any test that happens to hit that data.

Please implement UTF-8 and UTF-16 (LE & BE) BOM support into the Newman .json parser, in addition to straight ASCII/ANSI.

Newman Version: 4.3.1
OS details (type, version, and architecture): Windows 10, amd64
Are you using Newman as a library, or via the CLI? CLI from PowerShell
Did you encounter this recently, or has this bug always been there: Just barely started using Newman
Expected behaviour: environment file is properly parsed, retaining Unicode characters if present
Command / script used to run Newman: newman run "test.json" -e "env.json" (filenames shortened)
Sample collection, and auxiliary files (minus the sensitive details): See attachment UTF16_environment_file.zip

Steps to reproduce the problem (probably):

Define an environment file and export it from Postman
Change its encoding to UTF16-LE w/BOM (e.g. using Notepad++)
Attempt to run a Newman test using that environment file

Source

jhardin-aptos

All 11 comments

Yep. I haven’t tested this personally yet. But I can imagine that this could be a problem if we are doing plain simple JSON.parse

@codenirvana - I have handled this in liquid json - https://github.com/postmanlabs/liquid-json

Can we replace all json.parse of input files with this and add tests?

shamasis on 29 Jan 2019

Well, since we are using another module for better json error help, this may be tricky. But it seems that .trim() removes BOM too.

shamasis on 29 Jan 2019

That's not surprising as the BOM is actually an encoded optional-whitespace character...

I hesitate to jump on the .trim() bandwagon absent testing to ensure that doing so does not lead to corruption of any Unicode data that may be in the file...

I haven't actually gotten test data with that yet (I'm still working on the basic data-driven testing framework itself), but I'd like to ask you ensure your test suite for this has a file containing non-latin Unicode data (like, Chinese characters) to verify it continues to be properly handled. I don't see anything like that in your commit.

Thanks!

jhardin-aptos on 29 Jan 2019

Yeah. Test suite has Unicode characters. And that did not fail.

shamasis on 30 Jan 2019

👍1

@jhardin-aptos https://github.com/postmanlabs/newman/pull/1874 should fix your issue. Can you check that out? git install the repo and switch to that branch and then do npm install with that directory as source. If that's too much work ;-) then we will keep this issue updated as release happens.

shamasis on 30 Jan 2019

At the moment just forcing ASCII encoding has unblocked us. When the next Newman release happens I'll test it and we'll upgrade the QA environment. If we _do_ get any tests that fail in the meantime due to that mangling non-Latin characters I'll note the cause and we'll ignore it.

Thanks!

jhardin-aptos on 30 Jan 2019

👍1

Alright, spoke too soon (thanks @codenirvana). The file you sent has more than the BOM, it's entire encoding is UTF8-LE (how stupid of me of even thinking otherwise.)

Automatic character encoding detection is tricky and using any third-party library to do that has the following two challenges:

They need OS bindings and are OS specific.
The detection libraries are HUUUUGE in size.

So, what makes sense is we open up options that allows users to specify what encoding should be used to read files.

I have stumbled upon some great work on a library called chardet (https://github.com/runk/node-chardet), where the tradeoff could be performance. I'm checking that out as I type.

shamasis on 31 Jan 2019

It is actually UTF16-LE.

I can see the pain in trying to detect whether a file is ANSI or UTF8... _But:_ you have a BOM, so that's not necessary. I'm a bit surprised JS isn't automagically interpreting that if you've specified you're reading from a text file.

Perhaps rather than just blindly discarding the BOM via trim(), do this: after reading the file check whether there is a BOM present (the file contents string starts with \xfeff or \xfffe), and if so then re-read the file explicitly specifying the correct encoding based on the BOM?

Alternatively: peek the first two bytes of the file to determine the encoding, then read it once with the correct encoding.

jhardin-aptos on 31 Jan 2019

👍1

I have updated the branch to do auto detection - for a limited subset of encodings.

shamasis on 4 Feb 2019

@jhardin-aptos Thanks for reporting, this has been fixed in Newman v4.4.0.
Know more about supported file encodings: https://github.com/postmanlabs/newman#file-encoding

codenirvana on 20 Feb 2019

Just upgraded to Newman 4.4.1 and ran a test with accented data from the database saved in a UTF-16LE (standard Powershell text output) environment file and it worked correctly.

Thanks!

jhardin-aptos on 25 Mar 2019

🎉1

Was this page helpful?

0 / 5 - 0 ratings