Linguist: Invalid byte sequence error when reading files with malformed utf-8 sequences

Created on 13 Jan 2013  路  14Comments  路  Source: github/linguist

For example, running linguist on this file throws an invalid byte sequence error:

$ wget https://raw.github.com/leoniedu/CongressoAberto/a4785785cb37e8095893dc411f0a030a57fd30f8/CongressoAbertoWP/wp-includes/js/swfupload/swfupload.js
$ linguist swfupload.js
/Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/lib/linguist/blob_helper.rb:209:in `split': invalid byte sequence in UTF-8 (ArgumentError)
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/lib/linguist/blob_helper.rb:209:in `lines'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/lib/linguist/blob_helper.rb:240:in `loc'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/bin/linguist:24:in `'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/bin/linguist:23:in `load'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/bin/linguist:23:in `
'
Bug

Most helpful comment

This has been resolved by https://github.com/github/linguist/pull/4730 which is now live on GitHub.com. Closing.

All 14 comments

I can confirm this error.

Is this still a bug?

Same thing for me...

Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:41:in `split': invalid byte sequence in UTF-8 (ArgumentError)
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:41:in `lines'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:100:in `compiled_coffeescript?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:56:in `generated?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:12:in `generated?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/blob_helper.rb:277:in `generated?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:74:in `block in compute_stats'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:69:in `each'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:69:in `compute_stats'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:43:in `languages'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/bin/linguist:14:in `<top (required)>'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/bin/linguist:23:in `load'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/bin/linguist:23:in `<main>'

I can confirm this also...

I received this error as well after manually running linguist on my repo.

I still can reproduce this with ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] on a debian wheezy.

This seems to be the same like https://github.com/github/linguist/issues/241

I had the same error several times (last time on pdt-git/public).

I tried using force_encoding() and encode() on line 58 without effect.

Confirmed this is still an issue on 1.9.3-p484

I'm going to close this. Ruby 2.0 is nearly two years old now and I just don't see us investigating this any time soon sorry.

If anyone else wants to take a stab at this then please be my guest :smile:

@arfon I get this error on the original file of this post with Ruby 2.2:

$ ruby --version
ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux]

Well then.

I have also encountered this many times using ruby 2.2. Here is a recent stack trace:

$ linguist swfupload.js 
/var/lib/gems/2.2.0/gems/github-linguist-4.2.7/lib/linguist/blob_helper.rb:266:in `split': invalid byte sequence in UTF-8 (ArgumentError)
    from /var/lib/gems/2.2.0/gems/github-linguist-4.2.7/lib/linguist/blob_helper.rb:266:in `lines'
    from /var/lib/gems/2.2.0/gems/github-linguist-4.2.7/lib/linguist/blob_helper.rb:283:in `loc'
    from /var/lib/gems/2.2.0/gems/github-linguist-4.2.7/bin/linguist:51:in `<top (required)>'
    from /usr/local/bin/linguist:23:in `load'
    from /usr/local/bin/linguist:23:in `<main>'

I still can confirm this error.
When I run the following command, my program crashed.

github-linguist Inderxer.asp.txt

BTW, The encoding for Inderxer.asp.txt is GB2312

This has been resolved by https://github.com/github/linguist/pull/4730 which is now live on GitHub.com. Closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Alhadis picture Alhadis  路  5Comments

siscia picture siscia  路  6Comments

arfon picture arfon  路  6Comments

lucasrodes picture lucasrodes  路  6Comments

GabLeRoux picture GabLeRoux  路  6Comments