Json: 17 MB / 90 MB repo size!?

Created on 22 Jun 2015 · 29Comments · Source: nlohmann/json

Even 17 MB download is not really that big, it is really unexpected to download that much for single-header json library. Moreover once cloned it takes 90MB on the disk, even more surprising.

And the reason for that are these benchmark test files introduced by cb873a42ed258871d3e8768bf27b8ff1209c89f7. Is it really good to put such large files in the repo!? Unfortunately once you have put it then it is not easy anymore to reduce repo size. You will need to rewrite.

Anyway please consider rewriting the repo, and keeping it clean & lean, then likely everyone can use it as submodule.

question proposed fix

Source

nanoant

👍11

Most helpful comment

To add to the discussion I will just add my argumentation here. I want to use "json" as a submodule to my project, because:

I want to be transparent is ownership, to give credits where they are due.
To facilitate updating from source.

However, the current repository is extremely large, due to the tests and benchmarks. This hinders again the distribution of my code. Therefore it would be great if a minimal repository can be maintained.

tdegeus on 15 Apr 2017

👍11

All 29 comments

Hi @nanoant,

any idea how to reorganize without fragmenting stuff around into other repos or to other sites? I understand your concern, but at least now everything is in one place. In particular, the benchmark helped to improve at least one pull request.

Cheers,
Niels

nlohmann on 22 Jun 2015

@nlohmann My concern is just these huge benchmark input files that blow the repo size, these may be just moved out from the main repo to some other repo, and then cloned on first benchmark run (in Makefile). Anything else is already pretty small.

Right now I am probably going for embedding json.hpp directly into our repo rather than using submodule for this reason.

nanoant on 22 Jun 2015

👍4

Would a special development branch help? Then I could move most of the stuff from there and leave the master branch small. (Sorry, I have little experience with Git.)

nlohmann on 22 Jun 2015

@nlohmann I think the problem is that once the object (huge file) is there in the repo then there is no difference in which branch it resides, the clone will still fetch whole repo including huge file.

I was able to reduce repo size to 3.4 MB with recipe taken from: http://manuel.manuelles.nl/blog/2011/12/22/shrinking-your-git-repository/

But still this is lots of effort, so I probably you should skip that, and leave the repo as is.

nanoant on 22 Jun 2015

@nlohmann
what do u think about breaking the repo into:

library repo: only contains the single header file
utilities repo: contain all other stuff (docs, tests, ...)

the library repo will be easy to be integrated into other 3rdparty codes (by git submodules, subtree ...)

to prevent radical changes to the current workflow, you can create another repo (a single header) as release repo and let people use that repo as dependency simply by git or cmake.

azadkuh on 20 Sep 2016

👍2

This split would make the development a bit inconvenient on my side. As I am doing this alone in my spare time, I would like to keep it as is.

nlohmann on 15 Oct 2016

Why would you include a single-header project as submodule?

nlohmann on 23 Feb 2017

That's how I use it. That way the link is explicit, and it tracks a particular version, rather than manually extracting the one header file from the repository with no indication that it came from another repo, and have to deal with the possibility that someone may modify the local copy. However, I have no problem with the size of this project as a submodule.

gregmarr on 23 Feb 2017

@nlohmann you can consider using git subtree split to move out the tests (can be optional submodule of main) and other big subfolders out of main repo, then you force push small master bash to main repo. This will break obviously other ppl forks, but I guess it will bring more good than harm overall.

nanoant on 23 Feb 2017

👍2

@nanoant To me, this splitting would be inconvenient, and I like having everything in one repository.

nlohmann on 23 Feb 2017

😕4 👎2 👍1

Guys. I think the best solution is to have a package manager that will download whole repo, insert json.hpp into headers dir and remove repo

fnc12 on 23 Feb 2017

To add to the discussion I will just add my argumentation here. I want to use "json" as a submodule to my project, because:

I want to be transparent is ownership, to give credits where they are due.
To facilitate updating from source.

tdegeus on 15 Apr 2017

👍11

I'll also chime in here - I also like using even single-header libs as a submodule to be able to track what version I'm using and easily be able to update by just pulling the submodule. And yes, for me it is also problematic that 90mb is downloaded. It takes ages, even on my fast connection and this is gonna hit everyone using my repo as well, including autobuilders. This memory/bandwidth cost of 50+ mb just coming from a single test file is ridiculous.

therocode on 5 Jul 2017

👍1

@nlohmann

just tried to pack every released json.hpp into this repo:
azadkuh/nlohmann_json_release

# size in KB

$> curl https://api.github.com/repos/nlohmann/json | grep size
   "size": 136956,

$> curl https://api.github.com/repos/azadkuh/nlohmann_json_release | grep size
   "size": 100,

azadkuh on 6 Jul 2017

🎉3

I would prefer not to maintain my own minified fork (or depend on anyone else's).

Wouldn't a submodule solution meet everyone's requirements?

Move tests to their own repo
Library has tests as a Git submodule
Minimal fetch is then git clone [email protected]:nlohmann/json.git
Maxmimal fetch is then git clone --recursive [email protected]:nlohmann/json.git

njlr on 8 Sep 2017

👍3

I believe this would make everybody happy indeed.

theodelrieu on 8 Sep 2017

👍8

_I know this is already closed, but as the discussion is still going on, I want to share my thoughts_

I wonder why this is needed? I mean @nlohmann is right it is nice to have everything in one place for development and IMHO this is the what this repo is for.
If you use this as a submodule then you want to use also the tests and so on as you are developing for this lib.
And after all it is a single header library so how can you get confused which version you have?

But to have the luxury of a git-submodule I would propose a different approach.
Create a sepeate repo only for released versions and only having the very view things that are needed (Readme, license, src, CMake, ...)
This repo can be maintained by the community and will not affect any development on this repo.

PRO

seperates developemnt vs release
minifies to the minimum needed
will also work for the planned splitting of files
can also be used by package managers
no interference with the developers
coherent with #698 and #643 and #700

CON
separate repo from actual development
extra "small" work for maintaining

dan-42 on 19 Sep 2017

👎1

Just to add to the discussion: I have created a repository at https://github.com/astoeckel/json which tracks the released json.hpp.

Note however that this repository is bare-bones and tailored towards my own needs. It just contains src/json.hpp and meson.build, as well as the licence and a minimal readme. Feel free to use the provided Python script update.py to create your own version of the tracking repository.

astoeckel on 4 Feb 2018

❤3 👍2

As of now, the size of repository has increased to whopping 435 MBs. The biggest offenders are test, benchmarks, and git history itself (.git):

isaac@host:~/temp/json# du -k -d 1
8       ./.circleci
724     ./single_include
44      ./.github
816     ./include
5908    ./doc
8       ./cmake
179368  ./.git
300     ./third_party
59388   ./benchmarks
197800  ./test
444664  .
root@host1:~/temp/json#

To reduce the size of test and benchmarks directory, we have to move their large files (mostly json/data files) into another source (e.g. a separate git repository). The questions is now how to populate and integrate them during testing/benchmarking, I can think of these two options:

Download them during the build by CMake's ExternalProject command from the external source (e.g. Git repo, HTTP server, etc).
Add them as a git submodule to the main repo.

I believe using ExternalProject is a better option since the resources will be downloaded automatically when building benchmarks/tests is triggered. The best part is that user does not have to do anything whether he/she wants to run the test/benchmarks or not.

We can later deal with the size of .git and try to rewrite the git history to remove the commits related to large files. Since these files has been mostly static, number of commit related to them is probably very low. Therefore, the other contributors shouldn't have much trouble (if any) in rebasing their clone with the revised upstream repo.

nickaein on 28 Apr 2019

👍2

Thanks for investigating! I have little time in the moment, but this is definitely a way to move forward!

nlohmann on 28 Apr 2019

I had a PR closed by the stale bot which used ExternalData, I don't have the time to look into it right now, so if somebody wants to go back from where I left, don't hesitate.

https://github.com/nlohmann/json/pull/1251

theodelrieu on 28 Apr 2019

@nlohmann: Sure! I've created a PR and the preliminary tests seems promising.

@theodelrieu: Thanks for pointing that out! I will look into it.

Since this change can affect the workflow of the users, there is the possibility that I've overlooked some use-cases. Therefore, besides passing the CI tests, hearing others' experience would important to make sure this change will not have negative effects.

nickaein on 29 Apr 2019

@nickaein @theodelrieu @gregmarr

Hi all, I finally had time to think about this issue and thing we should move forward with it.

As a first working draft, I created a branch https://github.com/nlohmann/json/tree/tests_external which loads the tests from a separate repository via

    include(FetchContent)
    FetchContent_Declare(nlohmann_json_tests
        GIT_REPOSITORY https://github.com/nlohmann/json_tests_tmp.git
        GIT_SHALLOW TRUE
    )

    FetchContent_GetProperties(nlohmann_json_tests)
    if(NOT nlohmann_json_tests_POPULATED)
        message(STATUS "Download tests")
        FetchContent_Populate(nlohmann_json_tests)
        message(STATUS "Download tests - done")
        add_subdirectory(${nlohmann_json_tests_SOURCE_DIR}/test)
    endif()

As a result, the tests folder can be removed from the repository, leaving a much smaller footprint.

However, I am not yet sure whether this is the best approach. In particular, I am afraid that tests and code get out of sync as a PR in the original repo cannot simply add tests on its own, but would need a PR in the tests repo first. Alternatively, I could only move the test files to the separate repo, but leave the JSON files in the separate repo.

Any ideas on this?

nlohmann on 29 Apr 2020

I think splitting code and tests into a separate repositories is a bad idea. Newer git versions have "sparse checkout" support, see https://www.git-scm.com/docs/git-sparse-checkout, so it would maybe possible to not checkout the tests folder.

For general users the various package managers should also help.

t-b on 29 Apr 2020

@nlohmann
this repository already has a great but large history, IMHO this repository is just fine for contributors with all test stuff and ..., but the final users may only need a single header file, and any other thing means YAGNI.

it may be a better idea to have a separate repository for released versions of json.hpp, this approach is both easier for you and other contributors to work on and manage this repo and also it's easier for final users to fetch a light repository with a single header file.

BTW every solution has its own pros/cons

azadkuh on 29 Apr 2020

👍1

@nlohmann
That's great! I'm not 100% sure about this solution yet, but seems be a good compromise.

Library and tests can get out of sync

I am afraid that tests and code get out of sync as a PR in the original repo cannot simply add tests on its own, but would need a PR in the tests repo first.

I agree. There is also the problem that after a while a user may want to checkout an older version of the library while the latest json_tests repo might not be backward-compatible with that version (e.g. a test file has been deprecated and removed in some commit).

Possible solution: We might have to (internally) tie the library version to a json_tests commit/tag. In FetchContent_Declare this can be done by using GIT_TAG. This can mitigate the out of sync problems as the PR author can explicitly point to a specific json_tests commit/tag (almost always the latest tag as the time of creating PR).

I could only move the test files to the separate repo, but leave the JSON files in the separate repo.

I believe you meant to keep the source code for tests here and only move the large JSON files into the separate repo.

I totally agree with this approach. The changes in the library and test source code are related in meaningful way and there is semantic to see them in single commit (or a single PR).

On the other hand, since the large JSON data for tests doesn't have much semantics and they rarely change, we can have them as an external repo.

Which one? test or benchmark data

There is also ~58 MBs worth of data for benchmarks. While they have smaller impact (compared to ~194 MBs of test data), we can move them if you want to be conservative.

In future though, both of them can be moved to an external repo.

Why?

Putting aside the technicalities, I believe we should consider the use-cases which make the users raise this issue. In some cases the user would be happy if along the single-header release, README, LICENSE, and CMakeList.txt (probably a stripped-down version) is also provided. For instance: #1066 and #1184.

Currently, they find the contents of releases inadequate and fall back to cloning the whole repo. This can be avoided if we add such files to the zip releases. I'm not sure if there are other considerations though.

nickaein on 1 May 2020

@t-b
Sparse checkout is an interesting solution! However, I wonder if it could be effective in the scenarios that people are complaining about repo size. The users can already use this solution, yet we still see complains. In addition, their hand might be tied in using this feature (e.g. in a automated environment like CI where they can only specify a URL to git repo).
And lastly, we still increase the size of .git directory in future whenever we add/modify test data.

Nevertheless, this solution can probably help some of the users. We can recommend this (as a temporary solution at least), until further notice.

@azadkuh

but the final users may only need a single header file, and any other thing means YAGNI.

There is already single header files available for such users at Releases page. Does providing them in a separate repo have any benefits?

nickaein on 1 May 2020

Thanks for the comments!

Right now, I think it would be best to only move the test and benchmark data to a different repository and leave the actual code in the main repository. They are updated rather infrequently, and maybe one can even mark a subset of tests that can be run without external data - these tests can then still be executed without any download.
I am not sure whether the contents of the releases need to be adjusted. I think most complaints about the size came from people who just added the library as submodule and never cared about releases in the first place.

nlohmann on 1 May 2020

Hi all, please have a look at #2081 where I implemented the approach of having a separate test data repository.

nlohmann on 3 May 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings