As reported on our mailing list, this CNT file seems to contain a non-ASCII character (î), which leads to a decoding error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-9-8d42a9688cc0> in <module>
1 fname = fnames[0]
----> 2 raw = mne.io.read_raw_cnt(fname)
~/anaconda3/lib/python3.7/site-packages/mne/io/cnt/cnt.py in read_raw_cnt(input_fname, eog, misc, ecg, emg, data_format, date_format, preload, verbose)
163 return RawCNT(input_fname, eog=eog, misc=misc, ecg=ecg,
164 emg=emg, data_format=data_format, date_format=date_format,
--> 165 preload=preload, verbose=verbose)
166
167
~/anaconda3/lib/python3.7/site-packages/mne/io/cnt/cnt.py in __init__(self, input_fname, eog, misc, ecg, emg, data_format, date_format, preload, verbose)
389 input_fname = path.abspath(input_fname)
390 info, cnt_info = _get_cnt_info(input_fname, eog, ecg, emg, misc,
--> 391 data_format, _date_format)
392 last_samps = [cnt_info['n_samples'] - 1]
393 super(RawCNT, self).__init__(
~/anaconda3/lib/python3.7/site-packages/mne/io/cnt/cnt.py in _get_cnt_info(input_fname, eog, ecg, emg, misc, data_format, date_format)
179 patient_id = int(patient_id) if patient_id.isdigit() else 0
180 fid.seek(121)
--> 181 patient_name = read_str(fid, 20).split()
182 last_name = patient_name[0] if len(patient_name) > 0 else ''
183 first_name = patient_name[-1] if len(patient_name) > 0 else ''
~/anaconda3/lib/python3.7/site-packages/mne/io/utils.py in read_str(fid, count)
239 b'\x00' in data else count]])
240
--> 241 return str(bytestr.decode('ascii')) # Return native str type for Py2/3
242
243
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 0: ordinal not in range(128)
Should the function read_str default to decoding latin-1 aka 8859 instead?
Without the CNT file (or standard) specifying its encoding you can't know if this should be interpreted as latin-1, latin-9, utf-8, etc.
Yes, we don't know that for many file formats. EDF also doesn't specify the encoding, but because we have encountered many files using non-ASCII characters in the wild we're just guessing that people are using 8859 (which is probably a pretty good guess).
The EDF standard actually specifies ASCII...
Isn't there a decode option to illegal unidentified bytes with a filler char? That would seem to be the most agnostic solution (at least for single-byte encodings).
You're right, it does specify ASCII. Nevertheless, people use non-ASCII strings all the time, and so far 8859 has worked fine (since 8859 is a superset of ASCII, standard-compliant EDF files are still decoded correctly, but with the added benefit that we can decode many real-world files as well). And yes, the decode method does have an errors parameter, which we could set to 'ignore' (then all non-ASCII characters are missing) or 'replace' (then all non-ASCII characters will be replaced by �). Personally, I find that assuming 8859 is more practical.
I'm just very strict on encoding stuff -- not being strict in the past is why we have such messes today. Moreover, I work with enough encodings to really dislike assuming anything about something which clearly isn't ASCII. For me, the advantage for replace is that it's obvious an exceptional condition has been encountered and so the errors don't pass silently (which is in line with Zen of Python).
In general I agree with this philosophy, but sometimes practicality beats purity - which in this case trumps not letting errors pass silently. Again, that's just my personal opinion, hopefully others will also chime in (e.g. @agramfort, @larsoner, @hoechenberger).
Also, what do we lose when we assume 8859 instead of being strict and replacing illegal characters?
For ASCII text, both options produce identical results. For non-ASCII text, assuming 8859 will produce one of the extended characters, whereas we will have � characters otherwise. I've never encountered files with non-8859 characters, but in that case I'd be strict and set errors='replace'.
If the encoding isn't 8859, then you can decode to the wrong character. For example, Latin 1 and Latin 9 differ in their codepoint-to-character mappings in a few places. More generally: if you're using another non ASCII encoding that is ASCII-compatible in the lower 7 bits, then you'll be mapped to the wrong character. Without knowing the encoding, you simply don't know how to interpret a byte. You can guess based on the statistics of the byte distribution which language and encoding are being used, but you can't know.
I tried manually decoding bytestr but without success. Do you have an
encoding that works?
>
OK, it's just a single-byte encoding, and all 256 values are defined as some character. I'd still use 8859-1 as a best guess.
@agramfort here's an example:
>>> b"\xee".decode("8859") # 'î'
>>> b"\xee".decode("ascii") # UnicodeDecodeError
here is the bytestr:
bytestr
b'\xeeR@\xfd\xd8\x05\n\x0c\xe9\xe6X\x1d\x80\xba\x89\xab\xf1y\xfd\xf7'
I cannot find an encoding that works globally
latin-1 works for your bytestring:
>>> b'\xeeR@\xfd\xd8\x05\n\x0c\xe9\xe6X\x1d\x80\xba\x89\xab\xf1y\xfd\xf7'.decode("latin-1")
'îR@ýØ\x05\n\x0céæX\x1d\x80º\x89«ñyý÷'
But returns utter garbage! That's clearly not valid text.
But returns utter garbage! That's clearly not valid text.
I guess that @agramfort just entered arbitrary characters - this sure ain't French, or is it 🤣 ?
shall we switch to latin-1 as latin-1 should behave like ascii for ascii
valid char no?
>
shall we switch to latin-1 as latin-1 should behave like ascii for ascii valid char no?
This would be my preferred option, but we could also decode ASCII and replace invalid characters with �.
Why latin-1 instead of utf-8?
Because UTF-8 doesn't include all valid Latin-1 values, e.g.
>>> b'\xee'.decode('latin-1')
'î'
>>> b'\xee'.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 0: unexpected end of data
I tried what you suggest but all the strings in this file are broken. Even
channel names.
mne starts to complain as all channels have the same name '�����' ...
>
True - I haven't even looked at the file. It seems like this file is not a valid CNT recording. It's likely that some other recording device produces CNT files (and we're just supporting Neuroscan devices).
Yep, this seems to be an ANT Neuro file. I don't think that we have a reader for this format, do we?
Cf #3609
lol :)
yes there are 2 CNT formats (neuroscan and ANT). We do not support yet
ANT file....
I'm coming out of vacation mode briefly to cast a strong vote in line with @palday. If the standard says the file should be ASCII then we should support only ASCII and not try to guess the particular flavor of wrongness in a file. The point of such standards is to save us from all this error-prone extra effort.
@drammock this is not relevant now, because we don't support ANT CNT files.
However, if you really want to go ahead and remove all latin-1 decodings from the BrainVision, EDF, and NIRX readers, feel free to go ahead, but I'll assign all reports of people not being able to load their files to you 😉. More seriously, I'd say we keep things as they are, because we don't break anything and people are able to load files that are not standard EDF. Again, I absolutely hear you, but it doesn't really hurt to use latin1 instead of ascii in this case, because standard-compliant files will yield identical results.
Closing because the reason why the linked CNT file doesn't work is because we don't support ANT files. Feel free to revisit #3609 if anyone feels like implementing support for this format.
Since I'm the one who wrote the Latin-1 support for the BV reader: note that the BV format can declare its codepage and previously used Windows system defaults for INI files, so historically Latin-1 and now UTF-8 in Western Europe (because the VHDR file is essentially an INI file). Also, the error handling code there add ~50 lines. That's somewhat justified because µ can occur in the resolution field and provides useful info, but if it were just annotations, I would be all for just stripping it out and defaulting to replace.
Unfortunately the file is indeed an ANT file ... I followed #3609: using read_raw_antcnt from https://github.com/behinger/mne_tools I managed to open the ANT ".cnt" file. I had to use a python 2.7 kernel. Thanks a lot for your help!
good you point out to this python code.
then maybe it's easy to add support for ANT data.
@behinger would you be ok to help us here?
>
we need to see if we have some license issue (needs to be BSD compatible)
this is somewhat discussed in #3609. The problem is that there is currently no pure-python library. The libeep library is LGPL, but afaik external code is not possible to ship along with mne, correct?
What I can offer is to ask at ANT/eemagine whether they have a pure python importer / library we could use
this is not relevant now, because we don't support ANT CNT files.
Maybe not for this one file. But as @palday points out above:
If the encoding isn't 8859, then you can decode to the wrong character.
So if we do make an exception for some non-standards-compliant files, what is the justification for picking 8859 in particular as the allowed exception (and risking that files with some other encoding are thus errorful?)
arfff it's compiled code.... then it's still quite some work...
>
@behinger @agramfort a long time ago, I contributed to libeep ... and it's a lot of "fun", even by the standards of compiled code. If they don't have a suitably licensed Python reader available, then an inhouse two-person clean-room implementation is probably the way to go.
@drammock I'm not preventing you from submitting a PR that removes all latin-1 decodes from our EDF reader. I'm just saying that based on real-world examples it allows people to load their EDF files even if they are not 100% compliant to the EDF standard. Of course asking why we picked iso-8859-1 over say iso-8859-2 or iso-8859-15 is a valid question, we just found that iso-8859-1 solved the problems that people specifically reported.
A middle ground could be:
try:
s.decode('ascii')
except UnicodeDecodeError:
s.decode('iso-8859-1')
warn('Non-ASCII characters found in EDF header. Using 8859-1 decoding instead, be warned that this is an ad-hoc fix which is not compatible with the EDF standard.')
Most helpful comment
Maybe not for this one file. But as @palday points out above:
So if we do make an exception for some non-standards-compliant files, what is the justification for picking 8859 in particular as the allowed exception (and risking that files with some other encoding are thus errorful?)