Hosts: UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 3: character maps to <undefined>

Created on 12 Jun 2020  Â·  8Comments  Â·  Source: StevenBlack/hosts

Hi all 👋

I'm getting this error while running the script and updating the sources. I'm on Windows 10 with Python 3.8.3.

Traceback (most recent call last):
  File "updateHostsFile.py", line 1750, in <module>
    main()
  File "updateHostsFile.py", line 282, in main
    final_file = remove_dups_and_excl(merge_file, exclusion_regexes)
  File "updateHostsFile.py", line 937, in remove_dups_and_excl
    hostname, normalized_rule = normalize_rule(
  File "updateHostsFile.py", line 1025, in normalize_rule
    print("==>%s<==" % rule)
  File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 3: character maps to <undefined>

Most helpful comment

Assuming that we are talking about the cp1252 encoding as mentioned in:

File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

I can't (literally) reproduce.

$ # Change the Python encoding to CP1252 through the `PYTHONIOENCODING` environment variable.
$  export PYTHONIOENCODING="cp1252"
$ # Start the generation.
$ python updateHostsFile.py -a
[truncated]
==>fe00::0 ip6-localnet<==
==>ff00::0 ip6-mcastprefix<==
==>ff02::2 ip6-allrouters<==
==>ff02::3 ip6-allhosts<==
Success! The hosts file has been saved in folder 
It contains 57,286 unique entries.

Therefore, I don't know where the problem is here. Unless OP can give us more information, I'm not going to look for a problem which may not exist.


Other info

Python version

$ python -VV
Python 3.8.3 (default, May 17 2020, 18:15:42) 
[GCC 10.1.0]

Why using the PYTHONIOENCODING environment variable?

As the problem comes from print(), that means that I can reproduce by changing the default stdout encoding.

File "updateHostsFile.py", line 1025, in normalize_rule
    print("==>%s<==" % rule)

Here is the example, which proves that it's working.

$ export PYTHONIOENCODING="utf-8"
$ python
Python 3.8.3 (default, May 17 2020, 18:15:42) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'utf-8'
>>> print(u'\xe9')
é
$ export PYTHONIOENCODING="cp1252"
$ python
Python 3.8.3 (default, May 17 2020, 18:15:42) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp1252'
>>> print('\xe9')
�

Now what about \ufeff?

I never played with it but it is here good explained.

So I tried, with PYTHONIOENCODING (again).

With CP1252

>>> print('\ufeff')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/encodings/cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

With UTF-8

>>> print('\ufeff')

>>> 

Now, talking about this project (itself), I really don't know where \ufeff comes from as the line:

    print("==>%s<==" % rule)

is generated at the end... And I really can't find anything about this.


@StevenBlack @XhmikosR I leave the rest for you!

All 8 comments

Works fine here:

C:\Users\xmr\Desktop\hosts>ver

Microsoft Windows [Version 10.0.19041.329]

C:\Users\xmr\Desktop\hosts>python --version
Python 3.8.3

C:\Users\xmr\Desktop\hosts>python updateHostsFile.py
Do you want to update all data sources? [Y/n] n
OK, we'll stick with what we've got locally.
Do you want to exclude any domains?
For example, hulu.com video streaming must be able to access its tracking and ad servers in order to play video. [Y/n] n
OK, we'll only exclude domains in the whitelist.
==>fe00::0 ip6-localnet<==
==>ff00::0 ip6-mcastprefix<==
==>ff02::2 ip6-allrouters<==
==>ff02::3 ip6-allhosts<==
Success! The hosts file has been saved in folder
It contains 57,460 unique entries.
Do you want to replace your existing hosts file with the newly generated file? [Y/n] n

What's your system config and the exact command you are using to run the script? Also, I assume you are on the latest master?

C:\Users\xmr\Desktop>@systeminfo | @findstr /B /C:"OS Name" /B /C:"OS Version" /B /C:"System Locale" /B /C:"Input Locale"
OS Name:                   Microsoft Windows 10 Pro
OS Version:                10.0.19041 N/A Build 19041
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)

Closing.

@StevenBlack do note that this is probably valid and we should be using encoding="utf-8" in more places. It just happens with specific locales, probably.

@funilrys FYI

@XhmikosR I closed this because OP appears unresponsive...

@XhmikosR if I don't update the data sources it works fine, as you did:

Do you want to update all data sources? [Y/n] n

Does it still work for you when you update the data sources?

You still fail to give the requested info though. And yeah, it works fine here.

Assuming that we are talking about the cp1252 encoding as mentioned in:

File "C:\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

I can't (literally) reproduce.

$ # Change the Python encoding to CP1252 through the `PYTHONIOENCODING` environment variable.
$  export PYTHONIOENCODING="cp1252"
$ # Start the generation.
$ python updateHostsFile.py -a
[truncated]
==>fe00::0 ip6-localnet<==
==>ff00::0 ip6-mcastprefix<==
==>ff02::2 ip6-allrouters<==
==>ff02::3 ip6-allhosts<==
Success! The hosts file has been saved in folder 
It contains 57,286 unique entries.

Therefore, I don't know where the problem is here. Unless OP can give us more information, I'm not going to look for a problem which may not exist.


Other info

Python version

$ python -VV
Python 3.8.3 (default, May 17 2020, 18:15:42) 
[GCC 10.1.0]

Why using the PYTHONIOENCODING environment variable?

As the problem comes from print(), that means that I can reproduce by changing the default stdout encoding.

File "updateHostsFile.py", line 1025, in normalize_rule
    print("==>%s<==" % rule)

Here is the example, which proves that it's working.

$ export PYTHONIOENCODING="utf-8"
$ python
Python 3.8.3 (default, May 17 2020, 18:15:42) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'utf-8'
>>> print(u'\xe9')
é
$ export PYTHONIOENCODING="cp1252"
$ python
Python 3.8.3 (default, May 17 2020, 18:15:42) 
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'cp1252'
>>> print('\xe9')
�

Now what about \ufeff?

I never played with it but it is here good explained.

So I tried, with PYTHONIOENCODING (again).

With CP1252

>>> print('\ufeff')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/encodings/cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

With UTF-8

>>> print('\ufeff')

>>> 

Now, talking about this project (itself), I really don't know where \ufeff comes from as the line:

    print("==>%s<==" % rule)

is generated at the end... And I really can't find anything about this.


@StevenBlack @XhmikosR I leave the rest for you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

scafroglia93 picture scafroglia93  Â·  3Comments

hyphenized picture hyphenized  Â·  3Comments

The-Compiler picture The-Compiler  Â·  3Comments

mikhoul picture mikhoul  Â·  3Comments

Laicure picture Laicure  Â·  3Comments