Postgres: Locale utf8 doesn't conform to standard

Created on 16 Oct 2018  路  14Comments  路  Source: docker-library/postgres

Why is the locale here for UTF-8 set as utf8? This doesn't seem to conform with how other platforms are specifying the locale and is causing issues migrating between this postgres platform and others.

question

All 14 comments

Examples and/or documentation links would help us understand the case
you're making. 馃槈

The postgres settings we're trying to match are for AWS's postgres instances, and the default postgres settings on mac. The settings in particular are the following

lc_collate                             | en_US.UTF-8    
lc_ctype                               | en_US.UTF-8

These are in contrast to the ones in this dockerfile which are

lc_collate                             | en_US.utf8  
lc_ctype                               | en_US.utf8

I referred initially to this stack overflow post https://stackoverflow.com/questions/809620/utf8-or-utf-8 and was referencing https://www.iana.org/assignments/character-sets/character-sets.xml.

Having this locale set to en_US.UTF-8 will allow easy backups between docker based postgres databases and our other local and cloud db's as well.

Looking at https://www.postgresql.org/docs/10/static/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED however, it would appear that UTF8 is correct in the PostgreSQL context.

That's true. Looking at their locale settings though https://www.postgresql.org/docs/current/static/locale.html postgres initializes with the locale of the current environment. Seeing as environments outside of postgres use the standard character set would it make sense for the env var set in the dockerfile to match that?

I'm confused. What problem would it solve to switch from ENV LANG en_US.utf8 to ENV LANG en_US.UTF-8?

Linux glibc will internally normalize the encoding name, by converting it to lowercase & removing most special characters, so both variants will work

https://superuser.com/a/999151

I also think it would break since we send localedef the uppercase version, en_US.UTF-8, and it generates en_US.utf8:

$ docker run -it --rm debian:stretch
root@73eb5b25f453:/# locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
root@73eb5b25f453:/# env | grep LANG
root@73eb5b25f453:/# locale -a
C
C.UTF-8
POSIX
root@73eb5b25f453:/# apt-get update; apt-get install -y locales; rm -rf /var/lib/apt/lists/*;
Get:1 http://security.debian.org/debian-security stretch/updates InRelease [94.3 kB]
Ign:2 http://cdn-fastly.deb.debian.org/debian stretch InRelease
Get:3 http://cdn-fastly.deb.debian.org/debian stretch-updates InRelease [91.0 kB]
Get:4 http://security.debian.org/debian-security stretch/updates/main amd64 Packages [450 kB]
Get:6 http://cdn-fastly.deb.debian.org/debian stretch-updates/main amd64 Packages [5148 B]
Get:5 http://cdn-fastly.deb.debian.org/debian stretch Release [118 kB]        
Get:7 http://cdn-fastly.deb.debian.org/debian stretch Release.gpg [2434 B]
Get:8 http://cdn-fastly.deb.debian.org/debian stretch/main amd64 Packages [7099 kB]
Fetched 7859 kB in 2s (3043 kB/s)   
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libc-l10n
The following NEW packages will be installed:
  libc-l10n locales
0 upgraded, 2 newly installed, 0 to remove and 0 not upgraded.
Need to get 4107 kB of archives.
After this operation, 13.8 MB of additional disk space will be used.
Get:1 http://cdn-fastly.deb.debian.org/debian stretch/main amd64 libc-l10n all 2.24-11+deb9u3 [820 kB]
Get:2 http://cdn-fastly.deb.debian.org/debian stretch/main amd64 locales all 2.24-11+deb9u3 [3287 kB]
Fetched 4107 kB in 0s (5487 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libc-l10n.
(Reading database ... 6498 files and directories currently installed.)
Preparing to unpack .../libc-l10n_2.24-11+deb9u3_all.deb ...
Unpacking libc-l10n (2.24-11+deb9u3) ...
Selecting previously unselected package locales.
Preparing to unpack .../locales_2.24-11+deb9u3_all.deb ...
Unpacking locales (2.24-11+deb9u3) ...
Setting up libc-l10n (2.24-11+deb9u3) ...
Setting up locales (2.24-11+deb9u3) ...
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (Can't locate Term/ReadLine.pm in @INC (you may need to install the Term::ReadLine module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.24.1 /usr/local/share/perl/5.24.1 /usr/lib/x86_64-linux-gnu/perl5/5.24 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.24 /usr/share/perl/5.24 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) at /usr/share/perl5/Debconf/FrontEnd/Readline.pm line 7.)
debconf: falling back to frontend: Teletype
Generating locales (this might take a while)...
Generation complete.
root@73eb5b25f453:/# localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
root@73eb5b25f453:/# locale -a
C
C.UTF-8
POSIX
en_US.utf8
root@73eb5b25f453:/# 

For our purposes we don't run postgres on only linux systems where UTF-8 is normalized, and are running into postgres systems where LC_COLLATE and LC_CTYPE are set to en_US.UTF-8.

Our current issue is that it isn't straightforward to dump from our databases on docker without going through and altering the dump file to migrate data to any other database. Example databases include local postgres instances on Ubuntu, OSX, AWS RDS.

If en_US.UTF-8 would break locales then wouldn't C.UTF-8 have broken locales as well? So if both utf8 and UTF-8 work wouldn't the standardized format be preferred?

Ok, let's put this another way. Assuming PostgreSQL still works fine if we change this value to match these other systems, won't we then cause issues for existing users? (The same issues you're seeing with the current value, but in the reverse?)

It'll probably work reasonably well for you to simply modify LANG in your docker run line, stack.yml, etc for instances where you need this to match AWS -- I imagine glibc will likely auto-normalize it and everything will Just Work _and_ give you the values you're expecting without breaking any existing users.

This solution works for us, but I am of the opinion that in the long term switching that would be better for people in the long run even if it causes short term migration pains. I think it's in everyone's best interest if all their postgres instances shared the same format, given there already is a standard dictated.

Frankly I would love is selecting anything other than en_US was simpler - I have few instances with few small-ish databases in various collations, eg. en_GB.UTF-8 or sv_SE.UTF-8, and forcing the OS inside this docker image to generate, set, and use only en_US is troublesome.

Merely respecting LANG passed from the outside (with the en-US.UTF-8 being the default, fine), would be so much better than current hardcoded localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8 then LANG en_US.utf8

I know I can clone the repo and build my own image, yes.

TA.

Each locale added with default support in the image would increase the size of the image for everyone, so we stuck to a single utf-8 locale and document a simple way for adding more:

FROM postgres:11
RUN localedef -i de_DE -c -f UTF-8 -A /usr/share/locale/locale.alias de_DE.UTF-8
ENV LANG de_DE.utf8

Sure, given the constrains it's sane.

Cheers!

One reason to not change the current value: we will break users that want to upgrade from one version of postgres to another. (https://www.postgresql.org/message-id/1297357018.2340.10.camel%40schrader-laptop)

To get the locale to match your other environment , just set -e on your docker run (or orchestration yml equivalent):

$ docker run -d --name pg -e LANG=en_US.UTF-8 postgres:10
799059ec03eca42527bd57504bfd8b1d6a3b476ae2d6150427d69bd33fa2131c
$ docker exec -it pg psql -U postgres
psql (10.7 (Debian 10.7-1.pgdg90+1))
Type "help" for help.

postgres=# show lc_collate;
 lc_collate  
-------------
 en_US.UTF-8
(1 row)

(other locales will need to run a locale-def https://github.com/docker-library/postgres/issues/510#issuecomment-451290525)

@yosifkit
Even if I specify exact LANG during container creation behavior is still different comparing to AWS RDS. Is this bug on AWS side?

$ docker run -d --name collate -p 4444:5432 -e LANG=en_US.UTF-8 postgres:11-alpine
98d73f681f08cdbd6c23a43d1fdfa9ab24db299c99214430b112e018f7218e10

$ psql -p 4444 postgres postgres
psql (11.4, server 11.6)
postgres@postgres =# \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(3 rows)

postgres@postgres =# show LC_COLLATE;
 lc_collate
-------------
 en_US.UTF-8
(1 row)

postgres@postgres =# show LC_CTYPE;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)

postgres@postgres =# select 'a' > '_', 'a_' > '_a';
 ?column? | ?column?
----------+----------
 t        | t
(1 row)

postgres@postgres =# select 'a' > '_', 'a_' > '_b';
 ?column? | ?column?
----------+----------
 t        | t
(1 row)

AWS RDS

administrator@postgres => \l postgres
                                  List of databases
   Name   |     Owner     | Encoding |   Collate   |    Ctype    | Access privileges
----------+---------------+----------+-------------+-------------+-------------------
 postgres | administrator | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
(1 row)

administrator@postgres => show LC_COLLATE;
 lc_collate
-------------
 en_US.UTF-8
(1 row)

administrator@postgres => show LC_CTYPE;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)

administrator@postgres => select 'a' > '_', 'a_' > '_a';
 ?column? | ?column?
----------+----------
 t        | t
(1 row)


administrator@postgres => select 'a' > '_', 'a_' > '_b';
 ?column? | ?column?
----------+----------
 t        | f
(1 row)
Was this page helpful?
0 / 5 - 0 ratings