Mne-python: ASCII decode error with CNT files

Created on 25 Aug 2020 · 34Comments · Source: mne-tools/mne-python

As reported on our mailing list, this CNT file seems to contain a non-ASCII character (î), which leads to a decoding error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-8d42a9688cc0> in <module>
      1 fname = fnames[0]
----> 2 raw = mne.io.read_raw_cnt(fname)

~/anaconda3/lib/python3.7/site-packages/mne/io/cnt/cnt.py in read_raw_cnt(input_fname, eog, misc, ecg, emg, data_format, date_format, preload, verbose)
    163     return RawCNT(input_fname, eog=eog, misc=misc, ecg=ecg,
    164                   emg=emg, data_format=data_format, date_format=date_format,
--> 165                   preload=preload, verbose=verbose)
    166 
    167 

~/anaconda3/lib/python3.7/site-packages/mne/io/cnt/cnt.py in __init__(self, input_fname, eog, misc, ecg, emg, data_format, date_format, preload, verbose)
    389         input_fname = path.abspath(input_fname)
    390         info, cnt_info = _get_cnt_info(input_fname, eog, ecg, emg, misc,
--> 391                                        data_format, _date_format)
    392         last_samps = [cnt_info['n_samples'] - 1]
    393         super(RawCNT, self).__init__(

~/anaconda3/lib/python3.7/site-packages/mne/io/cnt/cnt.py in _get_cnt_info(input_fname, eog, ecg, emg, misc, data_format, date_format)
    179         patient_id = int(patient_id) if patient_id.isdigit() else 0
    180         fid.seek(121)
--> 181         patient_name = read_str(fid, 20).split()
    182         last_name = patient_name[0] if len(patient_name) > 0 else ''
    183         first_name = patient_name[-1] if len(patient_name) > 0 else ''

~/anaconda3/lib/python3.7/site-packages/mne/io/utils.py in read_str(fid, count)
    239                              b'\x00' in data else count]])
    240 
--> 241     return str(bytestr.decode('ascii'))  # Return native str type for Py2/3
    242 
    243 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)

Should the function read_str default to decoding latin-1 aka 8859 instead?

Source

cbrnr

Most helpful comment

this is not relevant now, because we don't support ANT CNT files.

Maybe not for this one file. But as @palday points out above:

If the encoding isn't 8859, then you can decode to the wrong character.

So if we do make an exception for some non-standards-compliant files, what is the justification for picking 8859 in particular as the allowed exception (and risking that files with some other encoding are thus errorful?)

drammock on 25 Aug 2020

👍2

All 34 comments

Without the CNT file (or standard) specifying its encoding you can't know if this should be interpreted as latin-1, latin-9, utf-8, etc.

palday on 25 Aug 2020

Yes, we don't know that for many file formats. EDF also doesn't specify the encoding, but because we have encountered many files using non-ASCII characters in the wild we're just guessing that people are using 8859 (which is probably a pretty good guess).

cbrnr on 25 Aug 2020

The EDF standard actually specifies ASCII...

Isn't there a decode option to illegal unidentified bytes with a filler char? That would seem to be the most agnostic solution (at least for single-byte encodings).

palday on 25 Aug 2020

You're right, it does specify ASCII. Nevertheless, people use non-ASCII strings all the time, and so far 8859 has worked fine (since 8859 is a superset of ASCII, standard-compliant EDF files are still decoded correctly, but with the added benefit that we can decode many real-world files as well). And yes, the decode method does have an errors parameter, which we could set to 'ignore' (then all non-ASCII characters are missing) or 'replace' (then all non-ASCII characters will be replaced by �). Personally, I find that assuming 8859 is more practical.

cbrnr on 25 Aug 2020

I'm just very strict on encoding stuff -- not being strict in the past is why we have such messes today. Moreover, I work with enough encodings to really dislike assuming anything about something which clearly isn't ASCII. For me, the advantage for replace is that it's obvious an exceptional condition has been encountered and so the errors don't pass silently (which is in line with Zen of Python).

palday on 25 Aug 2020

In general I agree with this philosophy, but sometimes practicality beats purity - which in this case trumps not letting errors pass silently. Again, that's just my personal opinion, hopefully others will also chime in (e.g. @agramfort, @larsoner, @hoechenberger).

cbrnr on 25 Aug 2020

Also, what do we lose when we assume 8859 instead of being strict and replacing illegal characters?

For ASCII text, both options produce identical results. For non-ASCII text, assuming 8859 will produce one of the extended characters, whereas we will have � characters otherwise. I've never encountered files with non-8859 characters, but in that case I'd be strict and set errors='replace'.

cbrnr on 25 Aug 2020

If the encoding isn't 8859, then you can decode to the wrong character. For example, Latin 1 and Latin 9 differ in their codepoint-to-character mappings in a few places. More generally: if you're using another non ASCII encoding that is ASCII-compatible in the lower 7 bits, then you'll be mapped to the wrong character. Without knowing the encoding, you simply don't know how to interpret a byte. You can guess based on the statistics of the byte distribution which language and encoding are being used, but you can't know.

palday on 25 Aug 2020

I tried manually decoding bytestr but without success. Do you have an
encoding that works?

agramfort on 25 Aug 2020

OK, it's just a single-byte encoding, and all 256 values are defined as some character. I'd still use 8859-1 as a best guess.

@agramfort here's an example:

>>> b"\xee".decode("8859")  # 'î'
>>> b"\xee".decode("ascii")  # UnicodeDecodeError

cbrnr on 25 Aug 2020

here is the bytestr:

bytestr
b'\xeeR@\xfd\xd8\x05\n\x0c\xe9\xe6X\x1d\x80\xba\x89\xab\xf1y\xfd\xf7'

I cannot find an encoding that works globally

agramfort on 25 Aug 2020

latin-1 works for your bytestring:

>>> b'\xeeR@\xfd\xd8\x05\n\x0c\xe9\xe6X\x1d\x80\xba\x89\xab\xf1y\xfd\xf7'.decode("latin-1")
'îR@ýØ\x05\n\x0céæX\x1d\x80º\x89«ñyý÷'

cbrnr on 25 Aug 2020

But returns utter garbage! That's clearly not valid text.

palday on 25 Aug 2020

But returns utter garbage! That's clearly not valid text.

I guess that @agramfort just entered arbitrary characters - this sure ain't French, or is it 🤣 ?

cbrnr on 25 Aug 2020

😄2

shall we switch to latin-1 as latin-1 should behave like ascii for ascii
valid char no?

agramfort on 25 Aug 2020

shall we switch to latin-1 as latin-1 should behave like ascii for ascii valid char no?

This would be my preferred option, but we could also decode ASCII and replace invalid characters with �.

cbrnr on 25 Aug 2020

Why latin-1 instead of utf-8?

Teekuningas on 25 Aug 2020

Because UTF-8 doesn't include all valid Latin-1 values, e.g.

>>> b'\xee'.decode('latin-1')
'î'
>>> b'\xee'.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data

cbrnr on 25 Aug 2020

I tried what you suggest but all the strings in this file are broken. Even
channel names.
mne starts to complain as all channels have the same name '��' ...

agramfort on 25 Aug 2020

True - I haven't even looked at the file. It seems like this file is not a valid CNT recording. It's likely that some other recording device produces CNT files (and we're just supporting Neuroscan devices).

cbrnr on 25 Aug 2020

Yep, this seems to be an ANT Neuro file. I don't think that we have a reader for this format, do we?

cbrnr on 25 Aug 2020

Cf #3609

cbrnr on 25 Aug 2020

lol :)

yes there are 2 CNT formats (neuroscan and ANT). We do not support yet
ANT file....

agramfort on 25 Aug 2020

I'm coming out of vacation mode briefly to cast a strong vote in line with @palday. If the standard says the file should be ASCII then we should support only ASCII and not try to guess the particular flavor of wrongness in a file. The point of such standards is to save us from all this error-prone extra effort.

drammock on 25 Aug 2020

@drammock this is not relevant now, because we don't support ANT CNT files.

However, if you really want to go ahead and remove all latin-1 decodings from the BrainVision, EDF, and NIRX readers, feel free to go ahead, but I'll assign all reports of people not being able to load their files to you 😉. More seriously, I'd say we keep things as they are, because we don't break anything and people are able to load files that are not standard EDF. Again, I absolutely hear you, but it doesn't really hurt to use latin1 instead of ascii in this case, because standard-compliant files will yield identical results.

cbrnr on 25 Aug 2020

Closing because the reason why the linked CNT file doesn't work is because we don't support ANT files. Feel free to revisit #3609 if anyone feels like implementing support for this format.

cbrnr on 25 Aug 2020

Since I'm the one who wrote the Latin-1 support for the BV reader: note that the BV format can declare its codepage and previously used Windows system defaults for INI files, so historically Latin-1 and now UTF-8 in Western Europe (because the VHDR file is essentially an INI file). Also, the error handling code there add ~50 lines. That's somewhat justified because µ can occur in the resolution field and provides useful info, but if it were just annotations, I would be all for just stripping it out and defaulting to replace.

palday on 25 Aug 2020

Unfortunately the file is indeed an ANT file ... I followed #3609: using read_raw_antcnt from https://github.com/behinger/mne_tools I managed to open the ANT ".cnt" file. I had to use a python 2.7 kernel. Thanks a lot for your help!

JacquesPesnot on 25 Aug 2020

good you point out to this python code.

then maybe it's easy to add support for ANT data.

@behinger would you be ok to help us here?

>
we need to see if we have some license issue (needs to be BSD compatible)

agramfort on 25 Aug 2020

this is somewhat discussed in #3609. The problem is that there is currently no pure-python library. The libeep library is LGPL, but afaik external code is not possible to ship along with mne, correct?

What I can offer is to ask at ANT/eemagine whether they have a pure python importer / library we could use

behinger on 25 Aug 2020

this is not relevant now, because we don't support ANT CNT files.

Maybe not for this one file. But as @palday points out above:

If the encoding isn't 8859, then you can decode to the wrong character.

drammock on 25 Aug 2020

👍2

arfff it's compiled code.... then it's still quite some work...

agramfort on 25 Aug 2020

@behinger @agramfort a long time ago, I contributed to libeep ... and it's a lot of "fun", even by the standards of compiled code. If they don't have a suitably licensed Python reader available, then an inhouse two-person clean-room implementation is probably the way to go.

palday on 25 Aug 2020

@drammock I'm not preventing you from submitting a PR that removes all latin-1 decodes from our EDF reader. I'm just saying that based on real-world examples it allows people to load their EDF files even if they are not 100% compliant to the EDF standard. Of course asking why we picked iso-8859-1 over say iso-8859-2 or iso-8859-15 is a valid question, we just found that iso-8859-1 solved the problems that people specifically reported.

A middle ground could be:

try:
    s.decode('ascii')
except UnicodeDecodeError:
    s.decode('iso-8859-1')
    warn('Non-ASCII characters found in EDF header. Using 8859-1 decoding instead, be warned that this is an ad-hoc fix which is not compatible with the EDF standard.')

cbrnr on 25 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

ENH: Power line noise filtering - spectrum interpolation and zapline

DominiqueMakowski · 32Comments

Q: CTF coregistration with separate coil position?

kingjr · 36Comments

ICA arguments (once again)

hoechenberger · 33Comments

ENH: Agenda for fNIRS processing

rob-luke · 51Comments

Trouble installing Spyder into MNE conda environment

hoechenberger · 56Comments