How should we manage encoding across the code base?
Metric are interpreted as UTF-8
return StringIO(data.decode("utf-8"))return open(path, encoding="utf-8")We write modifications to gitignore using UTF-8
git is encoding agnostic while writing to gitignore, example below:fname_utf8="ram贸n.txt"
fname_latin1="$(echo 'ram贸n.txt' | iconv -f 'utf-7' -t 'latin1')"
echo "something" > ${fname_utf8}
echo "something" > ${fname_latin1}
echo "${fname_utf8}" > .gitignore
echo "${fname_latin1}" >> .gitignore
git status --ignored
# "ram\303\263n.txt"
# "ram\363n.txt"
Python 3 uses utf-8 by default when encoding / decoding, not the system's default one.
def encode(self, /, encoding='utf-8', errors='strict')def decode(self, /, encoding='utf-8', errors='strict')if not encoding: encoding = locale.getpreferredencoding(False)Python 2 uses the system's default encoding when str.{encode,decode}
open(mode='wb') requires a bytes-like objectWhy we are not using system encoding?
I would expect git to use UTF-8 to interpret the file.
I think it can keep the same policy for DVC files (as we do now). Always use utf-8. We should def reflect this in the docs - specifically this section - https://dvc.org/doc/user-guide/dvc-file-format
Metric are interpreted as UTF-8
This is a good question. The only way I see to solve this is to provide an encoding parameter to the the metrics show command. I would use utf-8 as a default one for simplicity and would specify this in the documentation.
Py2 is dead, so I guess we can close this, as utf-8 is used by-default everywhere?
@efiop , let's close it :)