Vscode: Web: support other encodings than UTF-8

Created on 16 Aug 2019 · 40Comments · Source: microsoft/vscode

Ideas: https://github.com/microsoft/vscode/pull/79232

feature-request file-encoding insiders-released on-release-notes on-testplan web

Source

bpasero

👍1

Most helpful comment

I would like to also thank everyone envolved:

@LeuisKen for reporting the original issue and inspiring me to give it a shot with his own contributions
@bpasero for lying down initial plans and helping me to get this done
@ashtuchkin for being so responsive on iconv-lite side and giving amazing advice on bundle size reduction

Love the open-source community! ❤️

gyzerok on 5 Jul 2020

❤3

All 40 comments

Is there any plan for this issue ?

LeuisKen on 5 Sep 2019

👍1

Webpack seems possible for iconv-lite but not without modifications because it seems they do not want to bundle the stream support into the browser (related https://github.com/ashtuchkin/iconv-lite/issues/204). We would need:

to fork iconv-lite and make sufficient changes to include the streams support
export all streams that we actually use from node.js or rewrite our code to be node.js stream agnostic
possibly reduce the size of iconv-lite to only include the encodings we support (I measured ~1MB library size after webpack)

I used the following webpack.config.js:

const path = require('path');

module.exports = {
    entry: 'iconv-lite',
    devtool: 'inline-source-map',
    mode: 'production',
    output: {
        path: path.resolve(__dirname, 'lib'),
        filename: 'iconv-lite-umd.js',
        libraryTarget: 'umd',
        globalObject: 'typeof self !== \'undefined\' ? self : this'
    }
};

bpasero on 11 Apr 2020

Hey, iconv-lite author here. The issue you mentioned should be fixed now (needs some more checking in other environments). As an additional benefit, the Streaming API should work in the browser if 'stream' module is included (or if enabled explicitly). Let me know if you have any questions about enabling it.

I'm trying my best to keep the size to a minimum; in my measurements, with streams, iconv-lite takes ~320Kb minified by webpack (I just used webpack cli with default params). I guess 1Mb is the source map stuff. Would that be small enough for your purposes?

ashtuchkin on 18 May 2020

🎉3

@ashtuchkin thanks a lot for commenting here, I am planning to pick this up soon (coming months) and will report back to you how it goes 👍

bpasero on 18 May 2020

👍1

Awesome, looking forward to it!

ashtuchkin on 18 May 2020

Hello @bpasero!

I am in desperate need of this feature, so I was thinking maybe it is a good opportunity for me to contribute. However currently I have zero knowledge about vscode codebase, so I decided it'd be good to verify it with you before diving in.

Would you be open to this being an outside contribution? Do you think adding this feature is doable for the first time contributor? Would it be possible for you to provide me some guidance as which files/places in the project this relates to?

gyzerok on 25 May 2020

Thanks for the offer, but not having fully understood yet how to approach this from the VSCode side makes this a bit harder for me to ask for help. I think there are a few steps towards the solution that maybe can be decoupled and worked on individually. I think we need a way to consume iconv-lite in UMD form to make it work both in browser and desktop similar to what we did with https://github.com/microsoft/semver-umd. E.g. have a NPM module we can consume that is the webpacked version of iconv-lite.

One thing not clear to me yet is how to get node.js streams into VSCode web, because we heavily depend on node.js streams for all encoding related things. I am not sold in having to use node.js streams so maybe an alternate solution is to convince iconv-lite to have a very simple API that different stream solutions can be plugged into?

The code for encoding handling lives in nativeTextFileService which only exists in desktop for now.

bpasero on 25 May 2020

iconv-lite fully supports webpack (especially the 0.6 version I'm working on right now), so UMD package similar to semver-umd should work great. Before I release 0.6, please use master branch for now if you want to check it out.

As for the streams - iconv-lite internally uses a simplified stream interface for all codecs that is also compatible with Node's StringDecoder class (but doesn't depend on it). Basically the interface is the following:

interface Decoder {
  write(buf: Buffer): string;
  end(): string | undefined;
}

interface Encoder {
  write(str: string): Buffer;
  end(): Buffer | undefined;
}

// usage:
iconv.getDecoder(encoding): Decoder
iconv.getEncoder(encoding): Encoder

I'm wondering if that would help?

ashtuchkin on 25 May 2020

Awesome, thank you!

Looks like I have all the puzzle pieces. Will take a deeper look during the weekend. An opportunity to contribute to VSCode (which I have been happily using for 4 years) is very exciting :)

gyzerok on 26 May 2020

As for the streams - iconv-lite internally uses a simplified stream interface

@ashtuchkin does that mean I can pass in "something" that matches that interface and it will work? I would need this to be exposed from the TypeScript types of iconv-lite to work properly. I think today the type is still a node.js one:

https://github.com/ashtuchkin/iconv-lite/blob/de9c10bc9c424f49f451f34a9b09408f40b0aba7/lib/index.d.ts#L15-L17

If that was changed, maybe VSCode could use its own stream helpers and not the node.js ones.

bpasero on 26 May 2020

I can add these methods (getEncoder and getDecoder) to typescript definition, currently they are not there. I was treating them as non-public for the time being as they are not usually useful for Node.js. I did get requests to make them public before, so it probably makes sense to just do it.

The link you gave is for decodeStream/encodeStream - these are for Node.js streams, you're right. These are not what I was talking about.

does that mean I can pass in "something" that matches that interface and it will work?

It's the other way around: iconv-lite itself provides encoder/decoder objects with this interface and you can wrap it with your stream implementation/helpers to do the conversion.

ashtuchkin on 26 May 2020

As an example, I'm wrapping these Encoder/Decoder stream objects with Node.js Transformer class to create Node.js-compatible transformer streams here https://github.com/ashtuchkin/iconv-lite/blob/master/lib/streams.js#L27 (this.conv is the Encoder)

ashtuchkin on 26 May 2020

Added methods to .d.ts in the latest master commit (https://github.com/ashtuchkin/iconv-lite/commit/4c3eff9dac5c225f38500a633a0cfdc4ad6e0eb8).

ashtuchkin on 26 May 2020

👍2

Hello @bpasero!

I've taken a bit closer look into the source code and am thinking that I would really like to try to contribute this feature. However I totally understand that it is not something small and you might want to do it internally inside VSCode team.

So I would kindly ask you to give me a chance here and champion me if it is at all possible. From my side I will try to split the work into smaller chunks to make it easier for you to review it and give your feedback. If at any point in time you will think that this collaboration isn't working for any reason, I'll step down.

Here is my initial plan of how to complete the task:

Combine StringDecoder-like interface from iconv-lite with VSCode steam to decouple nativeTextFileService and encoding from node streams. This way I can better understand the whole process and test the whole story working on the code that should actually work already.
Create UMD build of iconv-lite.
Move encoding module from node to common.
Reuse my knowledge and possibly code from 1. to make multiple encodings work on the web the same way it works on node.
Possibly write some tests or migrate tests from node side (if any) to be used cross-platform.

How does this sound to you?

gyzerok on 28 May 2020

Yeah that sounds good to me. The first step of removing node.js streams for all encoding matters and using our own streams VSCode provides that are web supported is a good starting point.

Btw, we use another library jschardet to support "guessing" of encodings. You will probably see that during the process. I am not sure if that one can run in the web as well, I have not looked into it yet. But luckily the use of this library is limited to very few places so the adoption should be much easier compared to iconv-lite.

Timing wise we are entering endgame next week, I will probably not have a lot of time to look into this until the week after next week, but feel free to create a PR for the first step. I even think the second step can be done too. For that purpose though I would go and create a (Microsoft owned) GH repository to host the UMD version of iconv-lite. Does that make sense that MS publishes this module even though you contribute to it?

bpasero on 28 May 2020

Timing wise we are entering endgame next week, I will probably not have a lot of time to look into this until the week after next week, but feel free to create a PR for the first step. I even think the second step can be done too.

Good luck with upcoming release then. Seeing this small dot above the gear always makes me happy! Let's see how far I can get with the first step by the time you are done with Thanos :)

Btw, we use another library jschardet to support "guessing" of encodings. You will probably see that during the process. I am not sure if that one can run in the web as well, I have not looked into it yet.

Hopefully it won't be a big deal. Will take a closer look during the process.

For that purpose though I would go and create a (Microsoft owned) GH repository to host the UMD version of iconv-lite. Does that make sense that MS publishes this module even though you contribute to it?

Sure, it absolutely makes sense.

gyzerok on 28 May 2020

@ashtuchkin do you think you can make exposing getEncoder/getDecoder types a patch release for 0.5? This will allow the first part of my work to be independent from 0.6 release timeline.

gyzerok on 28 May 2020

Btw I just created https://github.com/microsoft/iconv-lite-umd where I suggest we can work through PRs.

bpasero on 28 May 2020

👍1

Here is the update for the steps:

I am done with the code, just need to recheck everything and properly open the PR.
Here is the PR https://github.com/microsoft/iconv-lite-umd/pull/1

If you'll happen to find time to review and publish iconv-lite-umd, I will include it's usage in my PR for the 1st step already.

gyzerok on 2 Jun 2020

Thanks, as I said we enter endgame this week so I will only have time to pick this up again the week after, but feel free to open up that PR.

bpasero on 2 Jun 2020

Thanks, as I said we enter endgame this week so I will only have time to pick this up again the week after

My apologies - had no intention to give you any pressure. Guess I was just happy about moving forward and wanted to share :)

The PR is now opened and linked to the issue.

gyzerok on 5 Jun 2020

Also a tiny jschardet update. It looks like they have a staightforward browser story: https://github.com/aadsm/jschardet#browser 🎉

gyzerok on 5 Jun 2020

In the meantime, since I've had more time this weekend, I've decided to reiterate on the plan for the next steps. No need to check it now, just leave it be until after we deal with the first steps.

The general idea is to make changes so that they can be merged and released separately. It will mean that we can roll them out gradually through insiders channel to see that nothing gets broken. It also means delaying enabling multi-encoding support in the browser until very last step.

[ ] Move encoding.ts to common.
- Update all the current places to use iconv-lite-umd instead of iconv-lite
- Add jschardet (we can use browser bundle which comes with the original package) and iconv-lite-umd to remote/web/package.json, workbench.html and workbench-dev.html
- Not sure how to be with encoding.test.ts - it uses fs, so it'll violate layers rules if moved to common, however it feels like it belongs there and fs is being used only for test purposes
[ ] Move EncodingOracle implementation from nativeTextFileService to textFileService
- Override it in browserTextFileService so it still enforces utf8
[ ] Move all read/write logic that now can be reused across platforms from nativeTextFileService to textFileService
- Reuse tests from nativeTextFileService for textFileService, we really don't need them twice for electron and browser now
[ ] Enable multiple encodings support for browser
- Remove EncodingOracle mock inside browserTextFileService so it inherits it
- Allow users to define files.encoding setting
- Add encoding button to the status bar
- Something else?

gyzerok on 7 Jun 2020

@gyzerok I have just released iconv-lite v0.6.0 with proper Electron support and the typescript updates. Let me know if you need anything else.

ashtuchkin on 8 Jun 2020

❤1

@ashtuchkin these are really great news! Thank you for release, appreciate it 😄

Actually you might be able to help me a bit more with reducing the UMD build size.

Full disclosure: I am not well versed in encodings, so most of my knowledge comes from wikipediaing for the purpose of completing this task 😅

In the https://github.com/microsoft/iconv-lite-umd/pull/1 there is a UMD build of iconv-lite. I have excluded utf7, utf32 and cp949. I've also tried excluding other things, but that breaks VSCode. The list of encodings supported by VSCode can be found here.

Do you think there is any other room in the PR for reducing the size?
Have you considered making the API so, that consumers can import encodings they need themselves and pass them into the library? Sort of similar to what you've done with the streams.

gyzerok on 9 Jun 2020

It looks like the inline source map takes more than half of the size (332kb out of 582kb). I'm wondering if we could put it in a separate file or get rid of it. This leaves us with 250kb of actual code.

I used webpack --display-modules to look inside and there's not much we could strip. Maybe a couple kb here and there (e.g. iconv-lite/lib/streams.js is probably not needed). If you already have Buffer class somewhere, we can exclude buffer module to save 50kb.

Have you considered making the API so, that consumers can import encodings they need themselves and pass them into the library? Sort of similar to what you've done with the streams.

It's an interesting idea, but will probably require a major refactoring. We haven't had a use case with such strict size requirements, so I didn't want to complicate things.

ashtuchkin on 9 Jun 2020

It looks like the inline source map takes more than half of the size (332kb out of 582kb). I'm wondering if we could put it in a separate file or get rid of it. This leaves us with 250kb of actual code.

That's a great discovery!

I was adjusting already existing config from semver-umd module and inline source maps are coming from there. Not sure if it was an oversight or done on purpose. Will ask in the PR, thank you for pointing it out!

gyzerok on 10 Jun 2020

@gyzerok @ashtuchkin the changes on master allow to consume iconv-lite without node.js stream dependency, which is great. However we still have a Buffer dependency. Can we change this to a Uint8Array dependency instead to be able to use this without having to shim Buffer?

bpasero on 17 Jun 2020

I guess this is more of a question to Alex, cause it is imported internally. Not sure if there is some way to override it from the outside, that I am missing.

gyzerok on 17 Jun 2020

Looks like https://github.com/ashtuchkin/iconv-lite/issues/235 is already opened for this matter.

I also just reported https://github.com/ashtuchkin/iconv-lite/issues/242. I feel it is a bit strange to have the stream based API talk about string. I would like to avoid having to allocate strings to reduce memory pressure. Not sure how it was before with the node.js streams, but imho everything should only go through Uint8Array

bpasero on 17 Jun 2020

@bpasero meanwhile I am not sure if https://github.com/ashtuchkin/iconv-lite/issues/235 blocks us from proceeding with implementation or I can go ahead even though Buffer shim is currently included. Can you clarify this for me?

Currently webpack uses https://github.com/feross/buffer which is around 50 KiB (27 KiB minified, around 7 KiB gzipped). It of course is redundant weight, however from my perspective it is acceptable temporarily.

gyzerok on 18 Jun 2020

@gyzerok yeah makes sense, how would VSCode access this buffer shim? I think iconv-lite-umd should possibly export it as from its index.d.ts?

I want to be careful to only really use it for iconv-lite and not spread it around anywhere else in code, so it can only be used in encoding.ts ideally.

bpasero on 18 Jun 2020

@gyzerok I think the best is to expose methods from the UMD Module that use Uint8Array and internally forward to iconv-lite as shimmed Buffer

bpasero on 18 Jun 2020

@bpasero

yeah makes sense, how would VSCode access this buffer shim? I think iconv-lite-umd should possibly export it as from its index.d.ts?

Not sure what you are asking here.

I want to be careful to only really use it for iconv-lite and not spread it around anywhere else in code, so it can only be used in encoding.ts ideally.

From types perspective we do not expose any details of the fact that any shim is used, because only VSBuffer goes outside of encoding.ts.

gyzerok on 18 Jun 2020

@bpasero both Node.js Buffer and shim buffer extend Uint8Array. And VSBuffer internally refers to the thing it wraps as Uint8Array.

So even if some code will use vsBuffer.buffer it will think of it as Uint8Array which is true.

gyzerok on 18 Jun 2020

+1. Also for decoding, I'm 99% sure you can pass Uint8Array instead of Buffer and it'll work fine.

ashtuchkin on 18 Jun 2020

@gyzerok yeah anything really works for me for as long as it stays isolated to encoding.ts, please go ahead as you see fit and I can give feedback from a PR 👍 . I already commented in iconv-lite repo that you can feel free to use VSBuffer helpers if it helps. VSBuffer is our way of supporting both Buffer if available or fallback to Uint8Array.

bpasero on 18 Jun 2020

@ashtuchkin while working further on the implementation I've discovered that there is one case, where Uint8Array is not accepted in browser environment. Here is the code which is responsible: https://github.com/ashtuchkin/iconv-lite/blob/v0.6.0/encodings/internal.js#L48-L59

Internally StringDecoder calls Buffer.toString(encoding) which doesn't work for Uint8Array. Simple test case:

const iconv = require('iconv-lite-umd');
iconv.decode(Uint8Array.from([...iconv.encode('Lorem ipsum', 'utf16le')]), 'utf16le');

The easy way to fix it would be to change InternalDecoder to be as follows:

function InternalDecoder(options, codec) {
    this.decoder = new StringDecoder(codec.enc);
}

InternalDecoder.prototype.write = function(buf) {
    if (Buffer.isBuffer(buf)) {
        return this.decoder.write(buf);
    }

    return this.decoder.write(Buffer.from(buf));
}

InternalDecoder.prototype.end = function() {
    return this.decoder.end();
}

Do you think this is an acceptable change for iconv-lite?

gyzerok on 22 Jun 2020

Thanks @gyzerok for driving this from the VSCode side and @ashtuchkin for the iconv-lite side 👍

bpasero on 4 Jul 2020

I would like to also thank everyone envolved:

@LeuisKen for reporting the original issue and inspiring me to give it a shot with his own contributions
@bpasero for lying down initial plans and helping me to get this done
@ashtuchkin for being so responsive on iconv-lite side and giving amazing advice on bundle size reduction

Love the open-source community! ❤️

gyzerok on 5 Jul 2020

❤3

Was this page helpful?

0 / 5 - 0 ratings