Gitea: Seeing regular files with special characters in it

Created on 23 Oct 2020  Â·  17Comments  Â·  Source: go-gitea/gitea

  • Gitea version (or commit ref): 1.14.0

Description

I cannot see a file if it contains special chars. Is there any way to enforce watching the file anyway?
See this repo for instance.

The file contains this

# frozen_string_literal: true

puts 'Foobar!'

And it makes sense that gitea thinks, that this could be a binary file because of the . But it obviously isn't a binary file and I would love to see the file anyway. Instead of having to download it.

Funnily enough, the commit view shows the content anyway. So something is inconsistent here.

Screenshots

Screenshot from 2020-10-23 12-55-05

All 17 comments

What character does the filename contain?

The file contains this

# frozen_string_literal: true

puts 'Foobar!'

and the character causing this is "".

You can also inspect / checkout the example repo if it helps.

That's the unicode replacement character which essentially should be considered as text, not binary. How does GitHub handle this case?

That's the unicode replacement character which essntially should be considered as text, not binary.

I pasted the exact character. GitHub is only _showing_ the Unicode replacement character (which is IMHO also the best way of handling this). You can simply try it out by yourself:

  1. Go to the Gitea commit view
  2. Copy the text between the '
  3. Paste it here in a comment field on GitHub
  4. See how it is rendered by switching to the Preview tab

GitHub's rendering of unrenderable characters

https://github.com/zeripath/pathological/blob/be-broken/regular_text_file.rb is how it appears on github.com

So similar to the commit view of Gitea (when you're copying the test, you're getting the 'correct' character as well).
Thus, if the state of Gitea should be on par with the view of GitHub, the 'regular Gitea view' must be fixed, right?

I this case it is because http.DetectContentType returns application/octet-stream for this example:

https://github.com/go-gitea/gitea/blob/bfc553164aafaacf5a82427b85145aa2ef34aaa3/modules/base/tool.go#L357

https://golang.org/pkg/net/http/#DetectContentType

Thats also what my system returns locally when checking out the repo...maybe we need a different test for text then though this file is telling everybody else "im not a text file" so ...

Sounds lika a golang bug if � is detected as binary.

https://mimesniff.spec.whatwg.org/#identifying-a-resource-with-an-unknown-mime-type

It's an entirely different issue and yet related:

It the _comments_ Gitea doesn't render the character like GitHub does (�).
Should it it stay this way or adapted to GitHub's logic (or to put it differently: should I open an issue for that)?

Comments seem OK to me (I left an example that looks fine) and I suspect that is just another copy/paste issue or something.

To focus on this issue: It isn't just golang that detects mime type like this, the file command seems to also. So if there is a bug its in the general spec of mime type signatures maybe.

But I bet Github just doesn't use mime type detection for text files for these reasons.

We could instead maybe do something like this:

go // IsTextFile returns true if file content format is plain text or empty. func IsTextFile(data []byte) bool { if len(data) == 0 { return true } return utf8.Valid(data) }
Which seems to work in a few simple tests including this example

utf8.Valid sounds like a good idea for binary detection. That linked spec only does it on the first 1445 bytes and I generally think we should limit that parsing for performance reasons.

Well 0x1e is a control character - I think it's worth noting that well behaved documents should not have. I mean it's invalid in XML.

I mean it's invalid in XML.

Also every Go code is invalid XML and yet, Gitea is able to present Go code properly, right?
I'm not sure how we ended up that XML is the reference here. :wink:
_Other_ source code obviously _can_ contain different kinds of control characters.

Furthermore GitHub, which seem to be a reference for many things in Gitea, is handling it properly, too.

I absolutely agree that it's an edge case but I'm really curious how we got to XML conformity here. :thinking:

Have you considered that that the fact that it is invalid in w3c text formats like XML, HTML etc might be the reason why the content type is detected as binary...?

So the current check whether a file is a binary file is checking whether the file would be XML?
Again: this would exclude many code files from other languages, since Go or C code isn't a W3C format.

I mean, guessing file types is definitely difficult. A complicated content check could sometimes be worse than just mapping the file extension to a mime type. And even going for the magic bytes can lead to false results.

Or do you want to imply that the browser decides whether the View Raw view is shown instead the file contents? :thinking:

No I'm explaining that the reason why detectcontenttype is saying that the file is binary is because the character is not allowed in web text formats. (Even if browsers don't just barf on them they're supposed to be escaped in html etc.)

How do you propose we detect binary formats?

I ask in all honesty because the technique used in git itself is pretty horrible and fatally flawed - IIRC it's simply does the file contain a NUL (0x0) character within the first 1kb. This is why git handles UTF-16 formats badly.

The next problem we face is the detect content encoding problem. Because people persist in committing documents not in UTF-8 and in encodings like CP-1252 we need to detect these to show them. In general encoding detection libraries do not expect to see non CR or LF control characters and will assign a very low likelihood to any encoding.

Then once you've got those sorted you'll need to escape the control characters as they are absolutely not supposed to be in html documents - (I don't know whether highlighter is doing the correct thing above) - they should therefore be replaced.

Dealing with text encodings is not easy. It is subtle and full of problems. The utf8.Valid solution noted above would fail non-utf8 character encodings.

Thank you Zeri, your last comment indeed explains your thoughts pretty good. :+1:

Was this page helpful?
0 / 5 - 0 ratings