dvc: encoding across the code base

Created on 2 May 2019  路  4Comments  路  Source: iterative/dvc

How should we manage encoding across the code base?

  • At it's core, DVC is encoding agnostic, data files are uninterpreted sequences of bytes
  • Stage files are saved and interpreted as UTF-8

    • Should DVC raise and error when the user changes the encoding of the file?

    • Should we reflect this in our docs?

  • Metric are interpreted as UTF-8

    • Why are we not using the system default encoding?
    • GitTree reads the file before returning
    • return StringIO(data.decode("utf-8"))
    • WorkingTree just returns a file descriptor
    • return open(path, encoding="utf-8")
    • If we are showing a message to the users, where should we _catch_ this exception?
  • We write modifications to gitignore using UTF-8

    • Why we are not using system encoding?
    • git is encoding agnostic while writing to gitignore, example below:
fname_utf8="ram贸n.txt"
fname_latin1="$(echo 'ram贸n.txt' | iconv -f 'utf-7' -t 'latin1')"
echo "something" > ${fname_utf8}
echo "something" > ${fname_latin1}
echo "${fname_utf8}" > .gitignore
echo "${fname_latin1}" >> .gitignore

git status --ignored

# "ram\303\263n.txt"
# "ram\363n.txt"
  • Python 3 uses utf-8 by default when encoding / decoding, not the system's default one.

    • str: def encode(self, /, encoding='utf-8', errors='strict')
    • bytes: def decode(self, /, encoding='utf-8', errors='strict')
    • open: if not encoding: encoding = locale.getpreferredencoding(False)
  • Python 2 uses the system's default encoding when str.{encode,decode}

    • open(mode='wb') requires a bytes-like object
awaiting response question

All 4 comments

Why we are not using system encoding?

I would expect git to use UTF-8 to interpret the file.

I think it can keep the same policy for DVC files (as we do now). Always use utf-8. We should def reflect this in the docs - specifically this section - https://dvc.org/doc/user-guide/dvc-file-format

Metric are interpreted as UTF-8

This is a good question. The only way I see to solve this is to provide an encoding parameter to the the metrics show command. I would use utf-8 as a default one for simplicity and would specify this in the documentation.

Py2 is dead, so I guess we can close this, as utf-8 is used by-default everywhere?

@efiop , let's close it :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jorgeorpinel picture jorgeorpinel  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

tc-ying picture tc-ying  路  3Comments

TezRomacH picture TezRomacH  路  3Comments

siddygups picture siddygups  路  3Comments