Git: diff filter for utf-16 encoded files

Created on 23 May 2018  路  22Comments  路  Source: git-for-windows/git

  • [x] I was not able to find an open or closed issue matching what I'm seeing

Why not to include diff filter for utf-16 encoded files? Every setup I modify git for viewing diffs of these files from Visual Studio windows native apps: app.rc and resources.h files. There is no need to include huge iconv since Windows can convert utf-16 to utf-8 easily and I have build a small (9kb) tool that convert these files on the fly ( only if needed, because they can be ASCII in the older Visual Studio versions ).

up for grabs

Most helpful comment

@dscho yes, we can close it. I'll try to describe both methods in wiki later.

All 22 comments

Looks intersting. Do check on the main Git mailing list about their choices about the spelling of "utf16" etc I believe there are some codes to indicate Byte Order Marks etc. It's probably worth ensuring thay are in line with each other. Somewhere around here (You may already have it all correct ;-)

I think that Visual Studio uses only little-endian, however it's not really hard to add few lines to the conversion code. Again, it's not for every utf-xx file, only for resource files in Windows native projects in Visual Studio.

FYI, Git v2.18 will contain the new attribute working-tree-encoding that deals exactly with different encodings of text files.

@bbolli no, it's not. These files should be in utf-16. Only diff filter is acceptable solution.

Quoting the relevant commit:

convert: add 'working-tree-encoding' attribute

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git reencodes
the content to a canonical UTF-8 representation. On checkout Git will
reverse this operation.

So you define your files as UTF-16 and Git will convert to this on checkout and back to UTF-8 on git add for storage in the index and blobs. What do you think is missing?

Is there 100% guarantee that these files will be in utf-16 at checkin? Should everyone update git to the latest version and set some options in local git config? There is only problem with viewving diffs of utf-16 files and it can be done without conversion at every checkin/checkout. There is no advantage of keeping these files in utf-8 internally in git.

What I mean to say is: I think it's not useful to add a feature to Git for Windows now that will be added natively in a few weeks.

It's a different thing. It's not the auto line endings feature. If these files by some reason will not be converted back to utf-16 - they will not work.

I'm not talking about end-of-line.

I believe there a little bit of being "at cross purposes" in the discussion here.

If I understand the points correctly,

@crea7or (pavel) is refering to a few specific files in the worktree that are generated by VS and that Git, and diff, would normally detect as being binary. Meanwhile it is those same files, in that format, that he wants to be diff'ed. These files may not even be in the index/repository. Pavel's tool is there for the diff programme to use for on-the-fly conversion to utf-8 that diff will work with (from a local file). Meanwhile VS still has the original, unchanged, utf-16 file it needs to carry on in blisful ignorance. I'm guessing his (unstated!) set up is to set a --textconv via "Performing text diffs of binary files" in gitattributes (note the line "diffs generated by textconv are not suitable for applying.")

@bbolli (Beat) is a refering to the issue of adding/storing/committing those same files into the index/repository where there are new options for a reversible(?) 'working-tree-encoding' conversion into using utf8 as the cannonical storage format (just like one may use LF ending as connonical EOL on Windows). This capability is the other side of the coin, so that others (say those on *nix machines) could 'read' those files from the repo, without the 'working-tree-encoding' set, and simply see the utf8 representation, but clearly that would be useless for VS (needing utf-16), but then who'd run VS on *nix anyway...

Would that be about right. Hopefully the wordiness helps tease out the differences between the two "same but different" goals.

If @crea7or could confrm how their Git attibutes/conf settings are setup that may clarify the issue (maybe add it to the readme of the reference tool repo as well).

@PhilipOakley you have it right. My main beef with Pavel's method is that we shouldn't add a feature to Git for Windows that wil be implemented (and incompatibly, configuration-wise) by upstream Git in the next release (and we all know how fast Dscho releases his GfW after the upstream release!). Whether there's a conversion to UTF-8 for storage etc is an implementation detail IMO. The effect of both methods is the same: diffs don't just say "binary files differ", but the user can see the textual delta.

There is no need to include huge iconv

Too late ;-)

$ which iconv
/usr/bin/iconv

@dscho wow, now it's even easier!

@PhilipOakley since iconv already in repo, only script(a bit modified for iconv) is required and short note for users - it would be enough. I hope iconv will not convert non utf-16 resources files.

since iconv already in repo, only script(a bit modified for iconv) is required and short note for users - it would be enough. I hope iconv will not convert non utf-16 resources files.

@crea7or this sounds like a perfect opportunity for you to get involved.

@dscho ok, I'm ready. So, to be clear: I'll test iconv with windows resources files and decide what to do next with conversion, then prepare script and docs for PR?

@crea7or sounds good to me!

Thanks for your great summary @PhilipOakley !

@crea7or : Author of working-tree-encoding here. I implement the feature for Visual Studio users at @Autodesk and I would like to learn more about the problems you see.

Is there 100% guarantee that these files will be in utf-16 at checkin?

It should be consistent for certain VS file types. However, if they are not 100% UTF-16 at checkin ... wouldn't your textconv filter trip over this too?

Should everyone update git to the latest version

Unfortunately yes 馃槩 . I agree that this is a major downside of working-tree-encoding. However, in companies you can sometimes control the Git version and therefore it might not be that big of a problem. Time can only solve that I guess. Another problem are libgit2 based clients. I will work on libgit2 support for working-tree-encoding next.

and set some options in local git config?

A config change is not necessary. The conversion is defined in .gitattributes and that is stored in the Git repo.

There is only problem with viewving diffs of utf-16 files and it can be done without conversion at every checkin/checkout.

Unless you want to look at the content with web based tools like GitHub, BitBucket, GitLab...

There is no advantage of keeping these files in utf-8 internally in git.

You could consider it the "canonical form". That makes all kinds of text transformations (e.g. line ending conversions, but even search/replace in some cases) easier.

@larsxschneider didn't thought about web based viewers and canonical form is a very good thing in overall solidity of the tool. I will take a look into it.

@crea7or can we close this now? Or maybe you could write up your learnings about working-tree-encoding in a wiki page, to help the next users struggling with the very same problem as you are, and then we close it?

maybe you could write up your learnings about working-tree-encoding in a wiki page, to help the next users struggling with the very same problem as you are

I marked this ticket "Up for grabs" with this in mind.

@dscho yes, we can close it. I'll try to describe both methods in wiki later.

Was this page helpful?
0 / 5 - 0 ratings