Cabal: Lack of support for non-ASCII characters in cabal sdist

Created on 2 Sep 2016  路  20Comments  路  Source: haskell/cabal

This bug is really based in the tar package, and affects both Stack (https://github.com/commercialhaskell/stack/issues/2549 and https://github.com/commercialhaskell/stack/pull/2557) and cabal-install. Repro steps for cabal-install:

  1. git clone https://github.com/snoyberg/yaml
  2. cd yaml
  3. git checkout yaml/0.8.18.2
  4. cabal sdist
  5. tar xfv dist/yaml-0.8.18.2.tar.gz

Expected: Find a file with the name yaml-0.8.18.2/test/resources/accent茅/foo.yaml

Actual: The file is named yaml-0.8.18.2/test/resources/accente^A/foo.yaml

Note that Hackage itself rejects non-ASCII filenames, so this is _almost_ a moot point. However, if cabal sdist properly encodes the file path, Hackage will reject the upload with a useful error message, instead of silently accepting the upload (which happened to me). There may also be an argument to be made that Hackage's filepath checking code should be made a bit more strict I'm not sure.

upstream other

Most helpful comment

Perhaps it would be wise to additionally tighten up the non-ASCII character detection in tar so that Hackage rejects packages that are uploaded using today's cabal sdist/stack sdist. In particular, yaml version 0.8.18.4 has a corrupted filepath which ideally would have been rejected by Hackage.

All 20 comments

To be clear, this only applies to cabal sdist, and not cabal act-as-setup -- sdist, because Cabal's code calls the tar program directly which does not have this bug.

See upstream https://github.com/haskell/tar/issues/6

Thank you for the comment, you just unbroke my brain. I couldn't reconcile what I saw in the Cabal source code with the behavior I was experiencing.

It certainly puzzled me a while too.

So this is basically a dupe of https://github.com/haskell/tar/issues/6?

Yes, but it can be worked around inside cabal-install the same way I proposed doing it in Stack, see https://github.com/commercialhaskell/stack/pull/2557/files#diff-7f7c1fa6ec0ee064e1a11c95f5b1b025L110.

I was about to try my hand at a patch for this, but I'm not sure which library to use for the UTF8 encoding. Neither text nor utf8-string appear to be dependencies of cabal-install right now, and I doubt they are desired additions. I suppose an alternative is simply to check the paths to ensure they are ASCII and error out if not.

@snoyberg cabal-install does depend on text indirectly (via parsec), so using text is fine.

@snoyberg

https://github.com/commercialhaskell/stack/pull/2557/files#diff-7f7c1fa6ec0ee064e1a11c95f5b1b025L110

Something like this workaround would be acceptable for now, but in the long run we should fix it at the source.

I was just looking into "fixing it at the source" as well. It seems unclear how to do it. One idea I had is:

  • Introduce a bytesToTarPath function, which would be defined as bytesToTarPath isDir path = toTarPath isDir (BS.unpack path)
  • Introduce a toTarPathUtf8 function, which would be defined as toTarPathUtf8 isDir path = bytesToTarPath isDir (encodeUtf8 (T.pack path))
  • Deprecate toTarPath

I can float this idea over there too. But I'm not too excited about it, since it's just moving the exact same approach I'm recommending here and in Stack to the tar package, and there doesn't seem to be anything better without breaking backwards compatibility.

Actually, this is harder to resolve in cabal-install than it was in Stack, due to usage of the Tar.pack function. So in this case, it probably _does_ make more sense to move the fix upstream.

@snoyberg Since our aim is to make tarballs with unicode fail better, I was thinking about making the tar package fail directly on encountering paths with unicode.

IMO handling that case with UTF8 encoding makes more sense, but best behavior is certainly ambiguous.

So what do people recommend here? Should we generally switch cabal and hackage over to allowing UTF8 encoded file names? Or just try and fail in a clearer way?

Back in the day it wan't really possible to rely on file names being UTF8 on unix, but these days we probably can. The other reason we've historically been conservative on hackage is that we want to try and keep maximum compatibility for tools on all platforms and preserving stuff for long time periods, so sticking closely to Tar posix standards seems wise here.

And yes, I'd of course like whatever we decide to go upstream (can of course keep temporary workarounds).

Should we generally switch cabal and hackage over to allowing UTF8 encoded file names? Or just try and fail in a clearer way?

Let's implement the latter for now, which I think is much easier than the former.

I'm fully in favor of just failing more clearly. If not too complicated, I'd also recommend adding support now for UTF8 aware unpack in case the situation changes in the future, but that's not high priority IMO.

Ok, thanks for feedback folks. I concur, lets fail better now, and I'll put it on my TODO to support UTF8 filenames in the tar package, but won't treat it as an "omg! emergency!" level of urgency.

Perhaps it would be wise to additionally tighten up the non-ASCII character detection in tar so that Hackage rejects packages that are uploaded using today's cabal sdist/stack sdist. In particular, yaml version 0.8.18.4 has a corrupted filepath which ideally would have been rejected by Hackage.

True could add an ASCII check to the optional portability checks in the tar lib.

Moving this to 1.24.0.2.

Remilestoned.

Was this page helpful?
0 / 5 - 0 ratings