Pillow: filenames with utf-8 chars not ASCII, preventing port from OpenCV

Created on 4 Dec 2020  Â·  19Comments  Â·  Source: python-pillow/Pillow

PIL appears to not handle file names with non-ASCII characters.

import PIL
from PIL import Image
s = "/xxx/xxxx/xxxx/xxx_Diözesxxxxx.jpg"
PIL.Image.open(s)

Note: xxx to hide user info.
Occurs on all files names or paths that include characters like ö.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/PIL/Image.py", line 2891, in open
    fp = builtins.open(filename, "rb")
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 138: ordinal not in range(128)

Files open successfully with OpenCV (I'm converting from OpenCV to PIL). Rename files/path to only ASCII characters and everything works fine. Not a great solution.

  • OS: ubuntu:18.04
  • Python: sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)
  • Pillow: 8.0.1

The above is inside a docker image built from

FROM ubuntu:18.04

RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && apt-get install -y python3.6 python3.6-dev python3-pip

RUN ln -sfn /usr/bin/python3.6 /usr/bin/python3 && ln -sfn /usr/bin/python3 /usr/bin/python && ln -sfn /usr/bin/pip3 /usr/bin/pip

RUN pip3 install Pillow==8.0.1

WORKDIR /app
COPY . .

CMD python ./py-image-server.py

host is ubuntu 18.04

Most helpful comment

Excellent blog. Thank you nulano for pointing out Victor's blog. Key takeaways, legacy is messy and a active area.

Changed my Dockerfile to a more current version

FROM python:3.10.0a2-buster
RUN pip3 install Pillow==8.0.1
WORKDIR /app
COPY . .
CMD python bug.py

Where bug.py remains

import sys
print(sys.version)

import PIL
print(PIL.__version__)

from PIL import Image

print(Image.open("ö.jpg"))

and it works without any changes. ie: no -X in launch, no env PYTHONIOENCODING=utf-8, and no https://stackoverflow.com/a/58780738/724176.

bug_service_1  | 3.10.0a2 (default, Nov 18 2020, 13:05:03) 
bug_service_1  | [GCC 8.3.0]
bug_service_1  | 8.0.1
bug_service_1  | <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3310x2514 at 0x7FF23EBEF3D0>
bug_bug_service_1 exited with code 0

If you need to stay on ubunto:18.04, then hugovk suggestion works fine.

The Dockerfile

FROM ubuntu:18.04


RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && apt-get install -y python3.6 python3.6-dev python3-pip

RUN ln -sfn /usr/bin/python3.6 /usr/bin/python3 && ln -sfn /usr/bin/python3 /usr/bin/python && ln -sfn /usr/bin/pip3 /usr/bin/pip


RUN pip3 install Pillow==8.0.1

#https://stackoverflow.com/questions/5387895/unicodeencodeerror-ascii-codec-cant-encode-character-u-u2013-in-position-3/58780738#58780738
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_US.UTF-8
COPY ./default_locale /etc/default/locale
RUN chmod 0755 /etc/default/locale
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8


WORKDIR /app
COPY . .

CMD python bug.py

The default_local
environment=LANG="en_US.UTF-8", LC_ALL="en_US.UTF-8", LC_LANG="en_US.UTF-8"

Successful output

Attaching to bug_bug_service_1
bug_service_1  | 3.6.9 (default, Oct  8 2020, 12:12:24) 
bug_service_1  | [GCC 8.4.0]
bug_service_1  | 8.0.1
bug_service_1  | <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3310x2514 at 0x7FE8FF1F2A90>
bug_bug_service_1 exited with code 0

All 19 comments

Strange, I haven't tried the Dockerfile but I cannot reproduce it on Mac:

Python 3.9.0 (v3.9.0:9cf6752276, Oct  5 2020, 11:29:23)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> s = "xxx/xxxx/xxxx/xxx_Diözesxxxxx.jpg" # note: removed initial /
>>> Image.open(s)
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=128x128 at 0x107F42C10>
>>>

I used https://github.com/python-pillow/Pillow/blob/master/Tests/images/hopper.jpg renamed as xxx_Diözesxxxxx.jpg.

Nor can I reproduce it on CI with Python 3.6-3.9 on Ubuntu 16.04, 18.04, 20.04, Mac or Windows:

The problem here is coming not from Pillow itself but from the stdlib's builtins.open(filename, "rb"), which uses locale.getpreferredencoding(False) to decide what encoding to use.

What do you get for these? Is the encoding ASCII?

>>> open(s)
<_io.TextIOWrapper name='xxx/xxxx/xxxx/xxx_Diözesxxxxx.jpg' mode='r' encoding='UTF-8'>
>>>
>>> import locale
>>> locale.getpreferredencoding(False)
'UTF-8'
>>>
print(open("ro_user_files/ö.JPG"))

returns
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 14: ordinal not in range(128)

locale.getpreferredencoding(False)

returns
'ANSI_X3.4-1968'

ö
Attached is a file that does the same thing but is not a user file, it comes from Austria and I am located in Texas

print(open("ro_user_files/©"))

where content of file is "hello"
also fails.
So not related to PIL?
but why does OpenCV open these files with UTF-8 names (the ones with JPG binary content)?

Does print(open("ro_user_files/©", encoding="utf-8")) work?

If so, please could try editing your /usr/local/lib/python3.6/dist-packages/PIL/Image.py file to set the encoding too?

diff --git a/src/PIL/Image.py b/src/PIL/Image.py
index 8d3f6b0a..e903f802 100644
--- a/src/PIL/Image.py
+++ b/src/PIL/Image.py
@@ -2888,7 +2888,7 @@ def open(fp, mode="r", formats=None):
         filename = fp

     if filename:
-        fp = builtins.open(filename, "rb")
+        fp = builtins.open(filename, "rb", encoding="utf-8")
         exclusive_fp = True
>>> print(open("ro_user_files/©", encoding="utf-8"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 14: ordinal not in range(128)

Do these Docker tips help?

https://stackoverflow.com/a/58780738/724176

as per instructions

1) created file default_local with content:
environment=LANG="es_ES.utf8", LC_ALL="es_ES.UTF-8", LC_LANG="es_ES.UTF-8"

2) add to dockerfile, tried both before install python and after

RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_CA.UTF-8
COPY ./default_locale /etc/default/locale
RUN chmod 0755 /etc/default/locale
ENV LC_ALL=en_CA.UTF-8
ENV LANG=en_CA.UTF-8
ENV LANGUAGE=en_CA.UTF-8

still get

>>> from PIL import Image
>>> Image.open("/ro_user_files/ö.JPG")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/PIL/Image.py", line 2891, in open
    fp = builtins.open(filename, "rb")
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 15: ordinal not in range(128)

Problem https://stackoverflow.com/a/58780738/724176 refers to reading/decoding the information. The issue I uncover as to do with the file name. The file reads fine if I replace all non ascii characters in the name or path of the file.

I got it working with Docker.

First, I could repro the problem with this based on your Dockerfile:

FROM ubuntu:18.04

RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && apt-get install -y python3.6 python3.6-dev python3-pip

RUN ln -sfn /usr/bin/python3.6 /usr/bin/python3 && ln -sfn /usr/bin/python3 /usr/bin/python && ln -sfn /usr/bin/pip3 /usr/bin/pip

RUN pip3 install Pillow==8.0.1

WORKDIR /app
COPY . .

CMD python ./5077.py

5077.py has:

import locale
import sys

from PIL import Image

print(sys.version_info)
print(locale.getpreferredencoding(False))

print(open("xxx_Diözesxxxxx.jpg", encoding="utf-8"))

print(Image.open("xxx_Diözesxxxxx.jpg"))

Outputs:

sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)
ANSI_X3.4-1968
Traceback (most recent call last):
  File "./5077.py", line 9, in <module>
    print(open("xxx_Di\xf6zesxxxxx.jpg", encoding="utf-8"))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 6: ordinal not in range(128)

But adding this stuff to the Dockerfile and rebuilding:

FROM ubuntu:18.04

RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && apt-get install -y python3.6 python3.6-dev python3-pip

RUN ln -sfn /usr/bin/python3.6 /usr/bin/python3 && ln -sfn /usr/bin/python3 /usr/bin/python && ln -sfn /usr/bin/pip3 /usr/bin/pip

RUN pip3 install Pillow==8.0.1

# Stuff added here:
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_CA.UTF-8
COPY ./default_locale /etc/default/locale
RUN chmod 0755 /etc/default/locale
ENV LC_ALL=en_CA.UTF-8
ENV LANG=en_CA.UTF-8
ENV LANGUAGE=en_CA.UTF-8

WORKDIR /app
COPY . .

CMD python ./5077.py

And adding this to default_locale in the same directory as the Dockerfile:

environment=LANG="es_ES.utf8", LC_ALL="es_ES.UTF-8", LC_LANG="es_ES.UTF-8"

Outputs this:

sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)
UTF-8
<_io.TextIOWrapper name='xxx_Diözesxxxxx.jpg' mode='r' encoding='utf-8'>
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=128x128 at 0x7F2E735E2400>

but why does OpenCV open these files with UTF-8 names (the ones with JPG binary content)?

I haven't looked at OpenCV, but I would guess it bypasses the Python file handling mechanism and has its own written in C.

Thank you hugovk. You added the "locale patch" after PIL install where I did it before PIL install.
But when I tried to reproduce mine and your configuration I got an error when I enter the docker image.
Now I'm not sure this error was present or not before.

john@john-trx40-designare:~/Documents/GitHub/help-me-transcribe$ docker exec -it 42fd5cc5be16 /bin/bash
bash: warning: setlocale: LC_ALL: cannot change locale ("es_ES.UTF-8")

Then when I try to run bug.py

root@42fd5cc5be16:/app# python bug.py
3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0]
8.0.1
Traceback (most recent call last):
  File "bug.py", line 14, in <module>
    print(Image.open("/ro_user_files/ö.JPG"))
  File "/usr/local/lib/python3.6/dist-packages/PIL/Image.py", line 2891, in open
    fp = builtins.open(filename, "rb")
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 15: ordinal not in range(128)

Where bug.py contains

import sys
print(sys.version)
import PIL
print(PIL.__version__)
from PIL import Image

import locale

#print(open("ro_user_files/©"))

# print(open("ro_user_files/ö.JPG"))
# locale.getpreferredencoding(False)

print(Image.open("/ro_user_files/ö.JPG"))

I don't get the setlocale warning, odd if we're basically using the same Dockerfiles. I'd suggest to check why that's failing. Maybe it's possible to check the list of encodings and pick another?

Anyway, my output in fill:

$ docker build 5077 -t pillow-5077
Sending build context to Docker daemon   12.8kB
Step 1/14 : FROM ubuntu:18.04
 ---> 2c047404e52d
Step 2/14 : RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa &&     apt-get update && apt-get install -y python3.6 python3.6-dev python3-pip
 ---> Using cache
 ---> b263597bca42
Step 3/14 : RUN ln -sfn /usr/bin/python3.6 /usr/bin/python3 && ln -sfn /usr/bin/python3 /usr/bin/python && ln -sfn /usr/bin/pip3 /usr/bin/pip
 ---> Using cache
 ---> 26914f1d7760
Step 4/14 : RUN pip3 install Pillow==8.0.1
 ---> Using cache
 ---> a42a845cb87b
Step 5/14 : RUN apt-get clean && apt-get update && apt-get install -y locales
 ---> Using cache
 ---> f4a31a77e24f
Step 6/14 : RUN locale-gen en_CA.UTF-8
 ---> Using cache
 ---> a0f7734801a4
Step 7/14 : COPY ./default_locale /etc/default/locale
 ---> Using cache
 ---> 6177595fb4d8
Step 8/14 : RUN chmod 0755 /etc/default/locale
 ---> Using cache
 ---> fceffc20d1e8
Step 9/14 : ENV LC_ALL=en_CA.UTF-8
 ---> Using cache
 ---> e5cde9482465
Step 10/14 : ENV LANG=en_CA.UTF-8
 ---> Using cache
 ---> 1ed3c91a9d3a
Step 11/14 : ENV LANGUAGE=en_CA.UTF-8
 ---> Using cache
 ---> 215d781eb4bd
Step 12/14 : WORKDIR /app
 ---> Using cache
 ---> 7c838ad85eff
Step 13/14 : COPY . .
 ---> Using cache
 ---> 8cec01a19033
Step 14/14 : CMD python ./5077.py
 ---> Using cache
 ---> 8b6babba353e
Successfully built 8b6babba353e
Successfully tagged pillow-5077:latest

```console
$ docker run --rm pillow-5077
sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)
UTF-8
<_io.TextIOWrapper name='xxx_Diözesxxxxx.jpg' mode='r' encoding='utf-8'>

And I spotted my script should have left out the explicit encoding:
```diff
-print(open("xxx_Diözesxxxxx.jpg", encoding="utf-8"))
+print(open("xxx_Diözesxxxxx.jpg")

But it worked without it:

$ docker run -it --rm pillow-5077 /bin/bash
root@982014f50a7c:/app# python
Python 3.6.9 (default, Oct  8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding(False)
'UTF-8'
>>> open("xxx_Diözesxxxxx.jpg")
<_io.TextIOWrapper name='xxx_Diözesxxxxx.jpg' mode='r' encoding='UTF-8'>
>>> from PIL import Image
>>> Image.open("xxx_Diözesxxxxx.jpg")
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=128x128 at 0x7F7472CD6B00>
>>>

Looks like we have a different signature at step 2.
Not sure it is reasonable to get deterministic images when versions are not specified.

Building py_image_service
Step 1/17 : FROM ubuntu:18.04
 ---> 2c047404e52d
Step 2/17 : RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa &&     apt-get update && apt-get install -y python3.6 python3.6-dev python3-pip
 ---> Using cache
 ---> 73af97969ba9
Step 3/17 : RUN ln -sfn /usr/bin/python3.6 /usr/bin/python3 && ln -sfn /usr/bin/python3 /usr/bin/python && ln -sfn /usr/bin/pip3 /usr/bin/pip
 ---> Using cache
 ---> 5d3166544fee
Step 4/17 : RUN pip3 install ptvsd
 ---> Using cache
 ---> 8df10837b1f4
Step 5/17 : RUN pip3 install Pillow==8.0.1
 ---> Using cache
 ---> c554e17939dc
Step 6/17 : RUN apt-get clean && apt-get update && apt-get install -y locales
 ---> Using cache
 ---> ea9d050c34aa
Step 7/17 : RUN locale-gen en_CA.UTF-8
 ---> Using cache
 ---> 56b191bc5319
Step 8/17 : COPY ./default_locale /etc/default/locale
 ---> Using cache
 ---> 64d08c16b029
Step 9/17 : RUN chmod 0755 /etc/default/locale
 ---> Using cache
 ---> 57d823e1a288
Step 10/17 : ENV LC_ALL=en_CA.UTF-8
 ---> Using cache
 ---> cccb01c4a9ee
Step 11/17 : ENV LANG=en_CA.UTF-8
 ---> Using cache
 ---> 91bf1805a32e
Step 12/17 : ENV LANGUAGE=en_CA.UTF-8
 ---> Using cache
 ---> f0823e5b6c3f
Step 13/17 : RUN pip install --user numpy scipy matplotlib
 ---> Using cache
 ---> b6f8d0097fd6
Step 14/17 : WORKDIR /app
 ---> Using cache
 ---> ee3bebed939e
Step 15/17 : COPY . .
 ---> Using cache
 ---> 638d8eb59008
Step 16/17 : EXPOSE 55001 5678
 ---> Using cache
 ---> 72c44e36c799
Step 17/17 : CMD python ./py-image-server.py
 ---> Using cache
 ---> b1919d168558
Successfully built b1919d168558
Successfully tagged py_image_service:latest

I have no familiarity with locales, so not sure if this is meaningful or not

john@john-trx40-designare:~/Documents/GitHub/help-me-transcribe$ docker exec -it 42fd5cc5be16 /bin/bash
bash: warning: setlocale: LC_ALL: cannot change locale ("es_ES.UTF-8")
root@42fd5cc5be16:/app# locale -a
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_COLLATE to default locale: No such file or directory
C
C.UTF-8
POSIX
en_CA.utf8

Anyhow, big picture question: If a string is encoded xyw and it represents a path, why is the string encoded xyz not used? how does local come into play?

One more thing, I'm a little suspicious about the
bash: warning: setlocale: LC_ALL: cannot change locale ("es_ES.UTF-8")
on entering the docker image. It's not like me to miss something like this.

I still need to figure out why the workaround does not work for me. My host is pretty clean default ubuntu install since all development is done in docker. For example, es_ES does not exist on my host.

The workaround feels a little kludgy. Specifying the Spanish language in default_locale(LANG="es_ES.utf8") and a Canadian language in the Dockerfile (LC_ALL=en_CA.UTF-8), while image microservice residing in USA because I have a utf8 string for a path to a binary file (JPG picture) not in any language.

What does language have to do with path names?

Yeah, the locale probably doesn't have to be Spanish/Canadian, but set to _something_ to make it use UTF-8 instead of ANSI.

Thanks a million, en_US.UTF-8 in my case feels better.
Any idea why PIL/Python restricts paths based on language?
Everything about language appears related to how to present time, currency, dates. What is the connection to the path?
A utf8 path/fileName is a superset to ASCII, so why the extra work to use the default python string type as a path?

I haven't tested it, but I think you might be able to just update to Python 3.7 and maybe also add -X utf8 when running your program.

Any idea why PIL/ Python restricts paths based on language?

I would recommend reading Victor Stinner's blog (a core CPython developer) series on the handling of filenames in UTF-8 and other locales in Python. It explains some of the reasons for various changes as they were made to Python and its handling of paths. This is the last entry, but it links to previous ones in the series: https://vstinner.github.io/python37-new-utf8-mode.html

Excellent blog. Thank you nulano for pointing out Victor's blog. Key takeaways, legacy is messy and a active area.

Changed my Dockerfile to a more current version

FROM python:3.10.0a2-buster
RUN pip3 install Pillow==8.0.1
WORKDIR /app
COPY . .
CMD python bug.py

Where bug.py remains

import sys
print(sys.version)

import PIL
print(PIL.__version__)

from PIL import Image

print(Image.open("ö.jpg"))

and it works without any changes. ie: no -X in launch, no env PYTHONIOENCODING=utf-8, and no https://stackoverflow.com/a/58780738/724176.

bug_service_1  | 3.10.0a2 (default, Nov 18 2020, 13:05:03) 
bug_service_1  | [GCC 8.3.0]
bug_service_1  | 8.0.1
bug_service_1  | <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3310x2514 at 0x7FF23EBEF3D0>
bug_bug_service_1 exited with code 0

If you need to stay on ubunto:18.04, then hugovk suggestion works fine.

The Dockerfile

FROM ubuntu:18.04


RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && apt-get install -y python3.6 python3.6-dev python3-pip

RUN ln -sfn /usr/bin/python3.6 /usr/bin/python3 && ln -sfn /usr/bin/python3 /usr/bin/python && ln -sfn /usr/bin/pip3 /usr/bin/pip


RUN pip3 install Pillow==8.0.1

#https://stackoverflow.com/questions/5387895/unicodeencodeerror-ascii-codec-cant-encode-character-u-u2013-in-position-3/58780738#58780738
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN locale-gen en_US.UTF-8
COPY ./default_locale /etc/default/locale
RUN chmod 0755 /etc/default/locale
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8


WORKDIR /app
COPY . .

CMD python bug.py

The default_local
environment=LANG="en_US.UTF-8", LC_ALL="en_US.UTF-8", LC_LANG="en_US.UTF-8"

Successful output

Attaching to bug_bug_service_1
bug_service_1  | 3.6.9 (default, Oct  8 2020, 12:12:24) 
bug_service_1  | [GCC 8.4.0]
bug_service_1  | 8.0.1
bug_service_1  | <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=3310x2514 at 0x7FE8FF1F2A90>
bug_bug_service_1 exited with code 0
Was this page helpful?
0 / 5 - 0 ratings

Related issues

mmalenta picture mmalenta  Â·  3Comments

readyready15728 picture readyready15728  Â·  4Comments

HansHirse picture HansHirse  Â·  3Comments

Larivact picture Larivact  Â·  4Comments

nomarek picture nomarek  Â·  3Comments