Node: Building with full-icu by default

Created on 7 Mar 2018  Ā·  113Comments  Ā·  Source: nodejs/node

Currently, users cannot rely on full i18n support to be present cross-platform and even cross-distribution mainly because different package maintainers use different configurations for ICU and if Node.js was built with system-icu one still has to have libicu installed. Browsers on the other hand generally do support full i18n out of the box.

There is the option to use the full-icu package but it is somewhat awkward to use as it requires a environment variable or commandline switch to work.

Building with full-icu is currently a ~40% increase binary size (on macOS, it goes from 35M to 49M). Is this an acceptable tradeoff? I'm thinking that if we build with it, ICU data should be moved in-tree so the build does not rely on external downloads.

cc: @nodejs/intl

feature request i18n-api

Most helpful comment

+1 to this, 35M to 49M is totally worth full i18n

All 113 comments

The binary size has always been the key issue with ICU, particularly on resource constrained devices where Node.js is expected to run. As long as we still have the ability to produce small-icu builds, my personal preference would be to build with full-icu by default.

+1 to this, 35M to 49M is totally worth full i18n

-1

@mscdex can you elaborate? i assume you're -1 because of binary size?

One option is to get the full-icu package working without options. I need
help getting the platform part to work.

@devsnek Yes.

I'm also -1 on this for the same reason as @mscdex. I'd rather find a way to fix #3460 instead.

I was strongly against full-icu when ICU was introduced but I've since made a 180 flip. Ease of use and being able to rely on it just being there outweigh the downside of a bigger binary.

The binary size doesn't particularly trouble me anymore. If that was really an issue we would build without debug symbols or make them available as a separate download, but we don't.

Bigger runtime memory footprint is something that should be checked. ICU is pretty parsimonious though, what I know of it.

As long as we still have the ability to produce small-icu builds, my personal preference would be to build with full-icu by default.

Yes, building the smaller variants should still be an option. Thought, I don't see platforms where a few MBs more are an issue as the primary consumers for Node.js.

One option is to get the full-icu package working without options.

That would be an improvement, but it's still an issue that users first have to discover this package, probably after wondering why i18n APIs don't work like they do in browsers.

As a package author, I'm reluctant to include full-icu as dependency because of application size concerns (which must be carried by every application as opposed to once if it were built into Node.js).

@silverwind

users first have to discover this package

it could be a checkbox in the installer to installā€¦Ā or a suggested package within ubuntu, etc

which must be carried by every application

couldn't it be deduped via npm?

I was strongly against full-icu when ICU was introduced but I've since made a 180 flip. Ease of use and being able to rely on it just being there outweigh the downside of a bigger binary.

I wasn't trying to do a bait and switch :) full by default I like for many reasons, simplicity, cultural-linguistic correctness, dropping the words 'English only' from the vocabulary , etc.. But I do understand that the size issues are real. Not insurmountable but real.

the way i see this, node has been working with negative size constraints because it was always missing full-icu. i build with full-icu everywhere anyway and like @silverwind said the primary consumers of node are not the people who can't download 40mb binaries. that being said, we can still keep our small-icu builds and just add full-icu builds

ICU should be moved in-tree so the build does not rely on external downloads.

  • ICU is already in-tree, but is specially trimmed to just have small data. Part of the size issue was repo size. So, one step could be include full data in the repo (Unless there's some other clever idea - subrepo? git LFS? lossy compression?)
  • Then, make 'full' the default out of the repo when you do configure && make (Ideally, packagers and other users should be specifying --with-intl ( none, full-icu, small-icu ) already and so this only affects the default.

One option as far as download could be to have two sets of downloads (small and full). That means complexity for the build and website project, though.

Some statistics:
full-icu gets 47k downloads a month, the underlying data package icu4c-data 41k a month. In Github, 112 repos and 45 packages depend on full-icu.

ICU is already in-tree, but is specially trimmed to just have small data. Part of the size issue was repo size. So, one step could be include full data in the repo

Yes, that's what I meant, the actual dataset. I don't think we'll need LFS or other nonstandard git extensions. Just need make sure that there are no unneccesary files in the tree.

Two use cases for smaller binaries: docker containers, and serverless bundles (some ppl package the node binary in their Lambda so they don't need to rely on AWS' outdated versions ā€“Ā even more would if the size was smaller, and bundle size is a big factor in startup speed)

As the maintainer of https://github.com/mhart/alpine-node I'm going to be in a bit of bind with this one ā€“Ā I could just force the builds to be small-icu, but then there'll be a slew of issues from users who haven't expected it to diverge from the other Linux builds. I could also offer the option for a small and a full build, but then I'm doubling the work that needs to be done each release.

I'm one of the ppl that has despaired as the binary size has crept up from 18MB over the years šŸ˜ž

@mhart:

I could also offer the option for a small and a full build, but then I'm doubling the work that needs to be done each release.

This is hopefully less than double the work, but you could force the builds to be small-icu, and then have another layer(/tag) that just adds the full datafile (via npm i [-g] full-icu perhaps) and sets NODE_ICU_DATA before launching node. Then it's at least not 2x the full container size to maintain.

@srl295 ooh, that's an interesting option actually. So will the full-icu builds be basically 100% compatible with someone installing the full-icu npm module? In that case... I'm not sure why we even include it in node core...? šŸ¤”

i'm still unsure why we have to stop providing small-icu builds if we add full-icu builds. in my mind we can provide both for every release? all this talk about binary size could even get us on track for streamlining "normal" and "small" node releases with several factors such a debug symbols and icu data etc.

@mhart the full-icu npm module just helps you get the right binary data file. Once that's there, node starts up and uses that data, rather than the embedded English-only content. A --with-intl=full-icu _build_ on the other hand builds ICU normally, and doesn't strip out the non-English content.

@devsnek

in my mind we can provide both for every release?

that's an option I mentioned in https://github.com/nodejs/node/issues/19214#issuecomment-372331186 ā€”Ā it could certainly be done that way.

@srl295 what I was getting was more: if this is available as a module, then I think it's worth thinking long and hard about adding it to core in the first place ā€“Ā considering that 95-100% of it won't be used by most node applications.

@silverwind are there some compelling arguments besides "browsers have this built-in"? Are there an unusual number of issues that arise from the current level of icu support?

@mhart There isn't a 'real' module. It's not modularizable. The full-icu npm module _by side effect of installing it_ causes the data file to be put somewhere, and the end user still has to tell node _before startup_ to go look for it. https://github.com/nodejs/node/issues/3460 mitigates that a little bit, by making node look in a well known place (help needed).

considering that 95-100% of it won't be used by most node applications.

Maybe, but more and more functionality is going through ICU. But as I said above and elsewhere, the pain point of package size is also very real.

There's also the less tangible questions such as, how many developers' experience is impacted by lack of support out of the box for their language, such as https://github.com/nodejs/node/issues/13907

Expected result (from Chrome 58.0.3029.110): '25 ŠøюŠ½Ń 2017 Š³.'
Actual result:

v8.1.2: '2017 M06 25'
v7.9.0: 'June 25, 2017

What if the user just walked away from Node instead of filing a report? some other quotes from the above and linked issues:

Edit: nevermind, from searching around I discovered that the ICU auto-detection thing is sorta bunk and that I need to explicitly point Node at the ICU package.

Thanks for info. But there's a problem with full-icu package: unicode-org/full-icu-npm#6 (comment)
Node v8 not auto detects it..

@srl295 guess i'm very late to the party here, and this is probably off-topic, but is there a way for it to be loaded lazily? ie, post startup? Not looking for a solution for the standard distribution of the Node.js binary with this question, but more for my own knowledge to know if users could potentially install a stripped down version of node with no icu (or system icu) and then require a full icu later if they wanted.

Define 'late' :) (old context: https://github.com/nodejs/node-v0.x-archive/issues/7676 and https://github.com/nodejs/node-v0.x-archive/issues/6371 )

is there a way for it to be loaded lazily? ie, post startup?

no.

that's the difficulty in https://github.com/nodejs/node/issues/3460 - it can't be implemented as a normal module, or using javascript to do the discovery. everything must be setup before ICU is called first. The only alternative is to shut down everything that is using ICU (including all Intl objects) and unload/reload ICU. All intl objects become unusable. That seems untenable.

@srl295 Is there anything internal that calls ICU in the normal Node.js startup though? So long as it's the first thing a user does in their Node.js script, it should be fine... right?

I see a couple of different issues being discussed

1) What should be the default? Should full-icu be the default because otherwise, we may lose people because the initial experience is not as good or is it ok to have people do an extra step to download/opt into full icu.

2) Is download size an issue. If download size is not a problem, but the installation size is, then a larger download binary followed by stripping out of the ICU data file might be an option. If download size does matter, then the only way to make full ICU the default but retain the option for smaller downloads is 2 downloads. That would be more work although we might be able to build/test once and simply package twice by creating one package with the ICU data file and one without.

I don't have enough data to argue for one side or the other on either of the 2 questions yet.

Do we think trying to survey Node.js users through an online survey would provide good enough data to help make the decision? (FYI @dshaw, in case we want help from the user-feedback group to do a survey). I so aww @jasnell has a twitter poll to get some data but something broader might also make sense,

@mhart v8 links against and loads ICU. I don't think there's any such guarantee.

@silverwind are there some compelling arguments besides "browsers have this built-in"? Are there an unusual number of issues that arise from the current level of icu support?

Not really, I guess. In my case (linked at the top of this issue), I was just noticing inconsistent locale support when formatting numbers/dates across platforms and thought that should be fixed.

I guess number and date formatting would be the most desireable feature that requires full-icu.

Observation: apt-get install nodejs has better i18n support than our own builds, courtesy of linking against the system ICU.

Hmm, maybe we could consider autodetecting if the host system has icu installed, and use the system icu if so?

That makes the binary non-distributable to other systems. We don't link to the system openssl for the same reason.

We know a full-icu binary is about 40% bigger and I've analysed the size impact on the source.


deps/icu-small is currently about 22M. 2.6M of that is the data file, 19.4M is source code.


ICU 60.2 can be trimmed down to 18.5M source code because source/tools is not needed any more.

Its data file is 26M. Git compresses it to about 11M. xz a.k.a lzma compresses it down to 6.2M.

We could check in the .xz and decompress at build time but I don't know: that only affects the size of a git checkout, the source tarballs are already LZMA-compressed. Probably not worth the complexity.


Impact on compile times: negligible on platforms that support the .incbin assembler directive, we can just include the data file verbatim. Probably even a little faster than it is now because there's no post-processing.

On Windows we have to do a bin2c conversion that produces a ~60M source file and takes 15-30 seconds to compile.

As an optimization we could write out a .obj directly instead of going through an intermediate .c file first. I haven't tried that yet.


To summarize:

  1. Source tarballs would be about 5.5M or about 30% bigger than they are now.

  2. A new git checkout would be about 11M bigger but it's already on the order of ~450M.

@bnoordhuis might be good to coordinate. I have some comments on your approach above anyway.

As an optimization we could write out a .obj directly instead of going through an intermediate .c file first. I haven't tried that yet.

I guess my comment on this got lost. We should already be doing that, where do you see this not being true?

OK, i looked up bin2c and see it's another tool. Actually, ICU already does its own direct conversion to .obj files on windows and should on other platforms also.

@bnoordhuis I implemented using dependencies.txt to fetch the list of ICU sources: https://github.com/nodejs/node/compare/master...srl295:icu-use-depslist

Any update for nodejs v10?

I need nodejs with full-icu, anyway.

Finally, I have started to use https://github.com/unicode-org/full-icu-npm in my project. It just works.

ex.

NODE_ICU_DATA=`$(npm bin)/node-full-icu-path` node ...

Any update about this or @srl295 proposal with the auto detection of the full-icu (see here)?
It would be great to have the full icu built-in and don't have to worry about it.
@bnoordhuis made a good description of the situation.

i think the next step here would be someone opening a pr and getting more immediate feedback from that. @srl295 or @silverwind are y'all interested in doing that? if not i'll label this as help-wanted

A pr for full icu sounds good. I'd keep it simple and not reimplement icu
data packaging.

El El miƩ, may. 23, 2018 a las 7:43 PM, Gus Caplan notifications@github.com
escribiĆ³:

i think the next step here would be someone opening a pr and getting more
immediate feedback from that. @srl295 https://github.com/srl295 or
@silverwind https://github.com/silverwind are y'all interested in doing
that? if not i'll label this as help-wanted

ā€”
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
https://github.com/nodejs/node/issues/19214#issuecomment-391569648, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA0Ms2UPpmW-ghMjebNbnJOnpRzc0Om7ks5t1h5fgaJpZM4ShQ05
.

I'd prefer if @srl295 does it, but if he's busy, I guess we'd be open to contributions.

I have a work-in-progress in https://github.com/bnoordhuis/io.js/tree/fix19214 but I don't expect I'll get back to it this month.

yes.. @bnoordhuis it has some interesting things, but reimplements a lot of machinery and won't work everywhere.

I am interested- would simplify a lot of things. I'll try a more minimal delta and get some comments.

Hi everyone! Are there some updates regarding this?
It would be really awesome to have full-icu built-in!

would putting icudt*.dat into LFS make any sense? seems like a good use case, but of course it adds complexity (ICU is deploying LFS for its source)

using lfs sounds fine to me. would we be able to overwrite the pointer file with the configure step if needed? (system doesn't have lfs installed)

@devsnek one could always pass a URL to download URL and just redownload ICU if lfs in't installed. Or I guess configure could detect the situation - icudt##l.dat is preposterously small (<10k) - is LFS installed?

Open issues: (check off as they are decided)

  • [ ] do we want to do this at all?

    • pro: full international support by default.

    • con: increased binary size, increased repo size

  • [ ] Should Git-LFS be used for the .dat file?

    • pro: makes 25M binary blob more manageable. less bloat to the repo.

    • con: ci machines/collaborators need to have git-lfs installed

  • [ ] should the mechanism to build small ICU data be kept (@jasnell ) or removed?

    • pro: it greatly simplifies node to just consider this an ICU configuration matter, and remove the tooling {for building} and runtime code {for adding in full-icu at runtime}

    • con: users may like how node builds small-icu, and not want to have to learn a different tool as part of the ICU build. ICU currently does not have a one button replacement for exactly what node's small-icu does.

+1 for making full-icu by default. Also when you install Node via Brew, it already has full-icu.

full-icu doesn't have to be the default, but could Node provide distributions of full-icu builds? i.e.

  • For every binary release, there should be a -full-icu binary too
  • The Windows and mac installers should have a checkbox to include full ICU or not

This allows version managers like brew or nvm to provide a switch --with-full-icu that then downloads the full-icu binary / provides it to the MSI

@felixfbecker that would be an option that would have no code change here.

I vote for full-icu to be the default. It seems to be how every other JS environment operates, and not having it just causes issues with Node. Tests run everywhere, but then fail on the CI service because of this.

+1 for full-icu by default.

+1 for full-icu by default.

This topic came up during the i18n session at the OpenJS summit, and as a follow-up I put together this alternative to packaging in the full icu data to the initial binary, while still retaining access to all locales:

https://github.com/eemeli/proposal-intl-loadlocales

So the idea here is to make it possible to lazy-load locale data into the Intl object. The actual loading process and data structure is completely hidden from the application-level API, allowing it to be configured and cached separately.

Given the way icu initializes data, it's not clear if this is possible. What information will be lazy loaded and what is the source?

Is there any major disagreement to doing this, and is anything blocking it?

Also: can anyone tell me where I can find a list of the subset of locales that are included in Node by default?

@srl295 can you comment on the default locales?

The default subset is: English

What's blocking this consensus to increase the repo size when there's
already workarounds.

The default subset is: English

It seems like it's just U.S. English (en-US). Taking this example:

const currencyFormatter = new Intl.NumberFormat('en-CA', {
  style: 'currency',
  currency: 'CAD',
  minimumFractionDigits: 0,
  maximumFractionDigits: 2,
});
console.log(currencyFormatter.format(0.02));

In Chrome (while in Canada) I get the expected result:

$0.02

But with Node I get:

CA$0.02

It would be a lot nicer for everyone outside of the U.S. if Node included full-icu by default.

@nwoltman You are correct. It is en which is effectively en-US.

It would be a lot nicer for everyone outside of the U.S. if Node included full-icu by default.

Hey, it could be nicer for people inside the US also!

So there are a couple of issues, depending on what you mean by 'default'. (some of these are discussed above, and in the original tickets going back to 2013)

  1. download size. What's available to download? node and packagers could provide multiple versions available for download. I.e. there could be a 'node-with-full-icu' download button on nodejs.org, click that to get a binary built with --with-intl=full-icu --download=all.
  2. repo size. What's available by default when you run configure? What needs something downloaded? Including the full data means the git repo is larger, and grows by a larger amount every year.. and, possibly means turning git-lfs
  3. user convenience. The installer (or other packagers) could add a checkbox to download the full icu dataset when you download node.

These 3 are somewhat independent. and there are downsides:

  • adding a full icu download is less nice for the people who run the download site. Making there be _only_ a full icu download is less nice if people are bandwidth constrained.
  • making a larger git repo is less nice for devs. Not wanting to pick on git-lfs, but it does add complexity.
  • adding complexity to the installer could be less nice, especially towards download speeds.
  • adding a full icu download is less nice for the people who run the download site.

Counterpoint: Not having a full icu download is less nice for people who need it.

Making there be only a full icu download is less nice if people are bandwidth constrained.

Is there any data to verify this point? Who would be "bandwidth-constrained" enough that they couldn't download a few more MBs?

  • making a larger git repo is less nice for devs.

This seems to be one of the biggest sticking points on this issue. It's definitely a tradeoff. Can anyone estimate how much "less nice" the repo will be for devs if the full icu data were added to it (both with and without git-lfs)?

Also, would making things less nice for Node devs perhaps be worth it to make things more nice for app devs?

  • adding complexity to the installer could be less nice, especially towards download speeds.

Again, how much "less nice" would this actually be? Is it possible to estimate how much slower downloads will be? For me, adding 15 MB slows things down by 5 seconds, which definitely isn't a big deal (especially when compared to running npm install).


Overall, my preference would be for the default Node.js binary to come with full icu (which would be great since that would make Node more consistent with browsers) and I have no opinion on installers/packagers providing a small-icu version (since I would never use it).

Is there any data to verify this point?

I did an analysis last year in https://github.com/nodejs/node/issues/19214#issuecomment-374216527. You can probably add a few percent because ICU gets bigger, not smaller, as time goes on.

Thanks @bnoordhuis, that data is awesome!

Based on that information, it looks like we can conclude the following:

Developer Experience

Repo Size

A new git checkout would be about 11M bigger but it's already on the order of ~450M.

So it sounds like git-lfs would not be needed, and the increase to the repo size wouldn't have a noticeable impact.

Compile Times

negligible on platforms that support the .incbin assembler directive... Probably even a little faster than it is now because there's no post-processing.
On Windows we have to do a bin2c conversion that produces a ~60M source file and takes 15-30 seconds to compile.

So compile times would not change or might improve for people on Unix systems. The slowdown on Windows would be unfortunate, but my guess is that most people would be compiling Node on a Unix-like system.

Downloader Experience

Source tarballs would be about 5.5M or about 30% bigger than they are now.

5.5M isn't very much. This probably wouldn't have an impact on most downloaders.


Based on that data, it sounds like the impact of switching to full-icu by default would be mostly negligible (with the only noticeable downside being slower compile times on Windows).

Nathan how do you usually install node?

Bandwidth constraints are one, also small devices.

The bandwidth bill for https://nodejs.org/ might be another issue.

The bandwidth bill for https://nodejs.org/ might be another issue.

use cloudflare

Nathan how do you usually install node?

Either downloading from the website or with the apt package manager (or sometimes relying on the Node.js Docker image). I understand that the increase in bandwidth won't affect me personally, so I am probably unaware of how it would affect other people as well as the services that need to distribute the binary.

The main question is, who exactly would be affected and by how much? With that answered, the negatives of switching to full-icu by default could be compared to the positives, and then things could move forward (or at least this issue could be closed).

@zuohuadong That's an interesting point. https://nodejs.org/ actually does use Cloudflare. Although, I'm not seeing a Cache-Control header when downloading the binaries, so I'm not sure if every binary download request is actually going to the nodejs.org servers.

I don't think we use CloudFlare for the downloads because we've not had time to do the work that it would take to continue to capture the download metrics. @rvagg can correct me if I'm wrong.

@nodejs/tsc I'm going to add to the agenda so we can get some feedback on what people think about potentially increasing the download size. If we get enough feedback on the issue before the next meeting we can remove.

Don't use cloudflare for /download/, but if this goes through it would be good incentive to get around to sorting out the blockers for that.

Hi there, just wanna share my POV (primarily working experiences) about the shortcomings of distributed Node w/o full-icu:

  • full-icu has to dynamically download icu4c-data based on OS and a lot of corporate CIs don't allow Internet access & have everything vendored either via a mirror or checked in. This creates an issue in build pipeline for products using Node supporting multiple OSes & versions.
  • Building Node from source w/ ICU also isn't trivial for bazel (Google & DBX happen to use it). This also becomes very hard to scale as devs have to do so for their local dev envs as well.
  • Differences in Node CLDR & shipped v8 CLDR versions: This creates subtle bugs since they can ship w/ different versions. Having node CLDR align w/ v8 in terms of intl would be great.

I do want to echo the concern mentioned earlier with increasing the binary size by 40%. as others have stated this may have negative effects on cold start times for containers which would effect serverless environments.

I've reached out to some of the engineers internally at google to get their take on how drastic a change like this could be.

Thanks for the context! I'd love to push for distributing a full-icu version in addition to the default. I understand that this creates extra work in the distribution phase but would be curious to learn what the LOE on that would be.

I do want to echo the concern mentioned earlier with increasing the binary size by 40%. as others have stated this may have negative effects on cold start times for containers which would effect serverless environments.

I don't expect that to be an issue. It's static read-only data that gets paged in on demand. You pay for what you use.

@bnoordhuis the design is that you only pay for what you use, and that it is demand paged. In a multitenant environment, it's probably better to share a larger node binary among multiple tenants, rather than have each tenant have their own downloaded copy of the full-icu data which I would guess won't be shareable even if they are identical copies.

I don't know how this translates to serverless, but i would imagine the larger node binary ought to be shareable. The data in question is marked as RO in the segment (pure text) for this reason.

We did discuss this in the TSC meeting today. From the minutes:

  • Building with full-icu by default #19214

    • Some concerns expressed last week.
    • binary size
    • code size

    • Michael to take action to send email on TSC email list to ask if there are objections to
      suggestion a PR be opened to make full-icu the default and to complete the conversation
      there.

One thing that was mentioned was having good numbers for the impact on the the following will be needed:

  • size of download
  • size unzipped
  • code size in github

@mhdawson Is https://github.com/nodejs/node/issues/19214#issuecomment-374216527 sufficient? It's from last year but the numbers won't have changed dramatically in that time.

@bnoordhuis thanks, From https://github.com/nodejs/node/issues/19214#issuecomment-374216527 the bottom line was:

  • Source tarballs would be about 5.5M or about 30% bigger than they are now.
  • A new git checkout would be about 11M bigger but it's already on the order of ~450M.

The remaining numbers that I think would be good to have in the summary is the additional size on disk once extracted and the in memory impact if you don't use any additional languages which from some of the discussion will not be the same as the additional size on disk.

Numbers I got today (Linux x64):

| ICU | node binary size | loaded memory | execution time |
|---|---|---|---|
| small | 43MB | ~28MB | ~0.02s |
| full | 66MB | ~28MB | ~0.02s |

To measure loaded memory, I ran ./node -p 'process.memoryUsage().rss / 1024 / 1024'
To measure execution time, I ran time ./node -e '1+1'

@targos thanks. I guess the other number is the size of the release tarball (which I assume is smaller than the node binary size since its compressed) as what @bnoordhuis provided was the source tarball sizes.

does full icu need to be in the repo? could it be the default for download but not build from source?

Given those numbers, I'm in favor of just bundling full-icu by default.

FWIW ICU releases are generally on a 9 month cycle. v8 is pretty aggressive about picking up new versions due to new ecma402/ecma262 requirements.

sgtm for full icu.

Il giorno gio 15 ago 2019 alle 22:55 Steven R. Loomis <
[email protected]> ha scritto:

FWIW ICU releases are generally on a 9 month cycle. v8 is pretty
aggressive about picking up new versions due to new ecma402/ecma262
requirements.

ā€”
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
https://github.com/nodejs/node/issues/19214?email_source=notifications&email_token=AAAMXY6BJKBZQ4PU4YTSPMLQEW7D7A5CNFSM4EUFBU42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4M67GI#issuecomment-521793433,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAMXYYDDTDIDGWD5VB6T6TQEW7D7ANCNFSM4EUFBU4Q
.

If full ICU is in the node repo, would the plan be to move to git-lfs? or just check in the 26M file as a binary blob icudt*.dat without lfs? ( ~6M with xz, ~10M with gzip, ~9.5M with bz2 )

I strongly prefer that we simply check in the (compressed) blob. git-lfs == extra friction.

We can use a python script to decompress it if we don't want a dependency on gunzip(1) or bunzip2(1).

Unfortunately no. lzma is python3 only.

Does building with full-icu impact heapsnapshot sizes, or memory overhead while taking one?

It shouldn't. ICU data is memory mapped and not allocated. It may bump rss
up but should have no impact on heap or generation of heapsnapshots

On Fri, Aug 16, 2019, 20:06 Jeremiah Senkpiel notifications@github.com
wrote:

Does building with full-icu impact heapsnapshot sizes, or memory overhead
while taking one?

ā€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/nodejs/node/issues/19214?email_source=notifications&email_token=AADLM6PT74AOIZIO4KUYOT3QE5TMTA5CNFSM4EUFBU42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QCAWQ#issuecomment-522199130,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADLM6LM46XMKEN2XMRDZPLQE5TMTANCNFSM4EUFBU4Q
.

Maybe this is too radical, but you could consider turning off .gz offerings and forcing .xz only from the point you start offering full-icu onward, that would save on storage and a little release build time, while encouraging users toward a bandwidth saving option too.

Discussed briefly on the TSC call today. The arguments for or against enabling full-icu by default have not really changed much. If we ignore the binary download size issue for just a minute, one of the more significant issues is the amount of reserved runtime memory required by memory mapping the full ICU data set. One commpromise approach we could take would be:

  1. Continue building the node.js binary with the small-icu subset statically linked
  2. Bundle the full-icu data with the node.js distribution such that it is always available but not used by default.
  3. Enable the use of full-icu with a command line switch.

This approach would be nearly identical to the existing approach using the full-icu npm distribution but would bundle the full-icu data set rather than having it installed manually. The node.js binary would also Just Knowā„¢ļø where to look for the data set rather than the user having to provide the full path.

Longer term, what would really help advance this along would be making changes internally to ICU that allow incremental on-demand loading of ICU data subsets. It's been talked about for quite some time but has not yet been implemented. Maybe (ahem, google) some company (ahem, microsoft) could dedicate (ahem, ibm) some resources to making those changes long term?

@jasnell

the amount of reserved runtime memory required by memory mapping the full ICU data set

This does affect the address space, but with 'full ICU by default' the full ICU data is just a single symbol (pure text) managed by the DLL loaderā€¦Ā unless your address space is 30M from being full, what is the impact? There should be no change to the actual active memory area unless you actually use the locales.

Bundle the full-icu data with the node.js distribution such that it is always available but not used by default

I'm not clear on the benefit here. What does this impact?

The node.js binary would also Just Knowā„¢

9617

incremental on-demand loading of ICU data subsets

loading from disk or from network/etc?

Just to repeat from discussion on IRC:

  • if you build ICU with full-icu, there's no mmap() involved. The data is just a large symbol loaded from the read-only pure text segment.

if you build ICU with full-icu, there's no mmap() involved. The data is just a large symbol loaded from the read-only pure text segment.

Ah, right, thank you for the reminder on that. It's been a minute since I've thought about ICU related things.

The concern on the memory is specific to container and serverless environments. @MylesBorins could likely expand on that a bit more.

@jasnell i wrote to @MylesBorins on IRCā€¦Ā it should only impact disk/network download size (and repo size). Should not impact to memory. But perhaps someone in those environments could help test?

@jasnell

Enable the use of full-icu with a command line switch.

Perhaps an opt-out for systems where e.g. rss increase could be a significant concern (or where there are others) would be a better option than an opt-in for everyone? Or perhaps we could introduce that in a staged way with a runtime icu selection flag and just update the default value in a minor release as we go?

so testing the docker image https://github.com/srl295/docker-node/commit/e99930011d698295c172f6919d93f4251081ad9e

$ docker images
REPOSITORY                         TAG                   ā€¦             SIZE
node                               12-alpine-small-icu   ā€¦        80.3 MB
node                               12-alpine-full-icu    ā€¦        105 MB
$ docker run -it --rm node:12-alpine-small-icu -p 'fs.statSync("/usr/local/bin/node").size'
45633344
$ docker run -it --rm node:12-alpine-full-icu -p 'fs.statSync("/usr/local/bin/node").size'
70282992

@ChALkeR

Perhaps an opt-out for systems where e.g. rss increase could be a significant concern

two options:

  1. configure ā€¦Ā --with-intl=none or ā€¦ --with-intl=small-icu - i don't think we should remove these options.
  2. still providing a download option built with small-icu?

@srl295, no, I mean a run-time command line flag similar to --with-intl configure option. Would that be viable?

@ChALkeR here's (I think ) my measurement of RSS use:

$ docker run --rm -it node:12-alpine-full-icu -p 'process.memoryUsage().rss / 1024 / 1024'
13.7265625
$ docker run --rm -it node:12-alpine-small-icu -p 'process.memoryUsage().rss / 1024 / 1024'
13.6796875

not sure yet if these are statistically significant.

@ChALkeR you could probably call u_setCommonData(p) at runtime:

  • where p is 'stubdata'. that will most likely just crash/fail
  • where p is the small icu data. Then you'd have to have both the small and full data on disk.

is the goal to disable Intl entirely? Why would you not just download a different node binary?

@srl295 Thanks!

No specific reason yet, just trying to think of possible ways forward in case if full-icu poses a problem for someone.

I am +1 to just shipping with full-icu by default at this point.

@ChALkeR i think just the current rebuilding with smaller icu at build time is the best opt-out option.

Why would you not just download a different node binary?

@srl295 building/providing more than 1 binary would be a significant additional overhead on the build/release side. We talked about the command line option to avoid having to have more than 1 binary.

If what you show above is the RSS difference with/without full ICU I dont' see that it should be an issue as they are very close to each other. Based on what you've said this is expected the only reason it was being discussed is because @MylesBorins raised an issue over mmap, but it seems like that does not apply from what you have said.

building/providing more than 1 binary would be a significant additional overhead on the build/release side

right.

OK, this isn't ready for PRime time, but i started a wip at https://github.com/srl295/node/commits/full-icu-default

  • full-icu is the default ( :tada: )
  • gzipped datafile (10M compressed, vs 3M old english-only file) gets temporarily uncompressed to deps/icu-tmp/icudt64l.dat
  • i think i left in the ability to build small-icu still (though it may require an (unnecessary) re-download of icu

maybe someone likes to make a customized node source tree that still has small icuā€¦Ā also, while this is 'new', it might be helpful to be able to compare both side by side.

Hopefully full-ICU will work with https://v8.dev/features/intl-numberformat.

Current version of Node.js does support locale argument, but only supports English. This is tricky to check.

How about this idea:
For developers who expect to get the same browser-like results on Node.js(I am one of them);
shows a warning message while calling methods with locale argument which is not supported by current build.

Well, full feature support is better.

Current version of Node.js does support locale argument, but only supports English. This is tricky to check.

It's better than that. It's not just English. Please see https://nodejs.org/api/intl.html#intl_providing_icu_data_at_runtime

There's no API that tells you what the exact situation is. However, there are internal parameters that will indicate that, as well as results returned by https://www.npmjs.com/package/full-icu ( the package ) or https://github.com/srl295/btest402

How about this idea:
For developers who expect to get the same browser-like results on Node.js(I am one of them);
shows a warning message while calling methods with locale argument which is not supported by current build.

There will always (well, at least for a long long time) be content not supported by the current build. The Torwali language isn't yet supported for example, so trw won't be supported. So I don't think a warning is needed.

I think the best is to make it easier to get 'full icu' to more people. And so I'm working on a PR for this issue.

@srl295 thanks and looking forward to the PR.

icudt64l.dat compression:

| type | Size | % |
|-----|----------|---|
| none | 27,531,792 | 0% |
| compress | 16,359,873 | 68% |
| gzip | 11,007,314 | 60% |
| bzip2 | 9,781,482 | 64% |
| xz | 6,669,432 | 76% |

Looks like bz2 is even in python 2.7 so may be worth while. (1.2Mb)

I tested v13.0.1 on Windows and found that Date.toLocaleString (and related) now respect their parameters and produce expected results (they do not on v12), so I created a PR on the repo for MDN's compatibility data.

https://github.com/mdn/browser-compat-data/pull/5068

The MDN compatibility table for these functions currently show "?".

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jmichae3 picture jmichae3  Ā·  3Comments

willnwhite picture willnwhite  Ā·  3Comments

mcollina picture mcollina  Ā·  3Comments

danielstaleiny picture danielstaleiny  Ā·  3Comments

seishun picture seishun  Ā·  3Comments