Even 17 MB download is not really that big, it is really unexpected to download that much for single-header json library. Moreover once cloned it takes 90MB on the disk, even more surprising.
And the reason for that are these benchmark test files introduced by cb873a42ed258871d3e8768bf27b8ff1209c89f7. Is it really good to put such large files in the repo!? Unfortunately once you have put it then it is not easy anymore to reduce repo size. You will need to rewrite.
Anyway please consider rewriting the repo, and keeping it clean & lean, then likely everyone can use it as submodule.
Hi @nanoant,
any idea how to reorganize without fragmenting stuff around into other repos or to other sites? I understand your concern, but at least now everything is in one place. In particular, the benchmark helped to improve at least one pull request.
Cheers,
Niels
@nlohmann My concern is just these huge benchmark input files that blow the repo size, these may be just moved out from the main repo to some other repo, and then cloned on first benchmark run (in Makefile). Anything else is already pretty small.
Right now I am probably going for embedding json.hpp directly into our repo rather than using submodule for this reason.
Would a special development branch help? Then I could move most of the stuff from there and leave the master branch small. (Sorry, I have little experience with Git.)
@nlohmann I think the problem is that once the object (huge file) is there in the repo then there is no difference in which branch it resides, the clone will still fetch whole repo including huge file.
I was able to reduce repo size to 3.4 MB with recipe taken from: http://manuel.manuelles.nl/blog/2011/12/22/shrinking-your-git-repository/
But still this is lots of effort, so I probably you should skip that, and leave the repo as is.
@nlohmann
what do u think about breaking the repo into:
the library repo will be easy to be integrated into other 3rdparty codes (by git submodules, subtree ...)
to prevent radical changes to the current workflow, you can create another repo (a single header) as release repo and let people use that repo as dependency simply by git or cmake.
This split would make the development a bit inconvenient on my side. As I am doing this alone in my spare time, I would like to keep it as is.
Why would you include a single-header project as submodule?
That's how I use it. That way the link is explicit, and it tracks a particular version, rather than manually extracting the one header file from the repository with no indication that it came from another repo, and have to deal with the possibility that someone may modify the local copy. However, I have no problem with the size of this project as a submodule.
@nlohmann you can consider using git subtree split to move out the tests (can be optional submodule of main) and other big subfolders out of main repo, then you force push small master bash to main repo. This will break obviously other ppl forks, but I guess it will bring more good than harm overall.
@nanoant To me, this splitting would be inconvenient, and I like having everything in one repository.
Guys. I think the best solution is to have a package manager that will download whole repo, insert json.hpp into headers dir and remove repo
To add to the discussion I will just add my argumentation here. I want to use "json" as a submodule to my project, because:
However, the current repository is extremely large, due to the tests and benchmarks. This hinders again the distribution of my code. Therefore it would be great if a minimal repository can be maintained.
I'll also chime in here - I also like using even single-header libs as a submodule to be able to track what version I'm using and easily be able to update by just pulling the submodule. And yes, for me it is also problematic that 90mb is downloaded. It takes ages, even on my fast connection and this is gonna hit everyone using my repo as well, including autobuilders. This memory/bandwidth cost of 50+ mb just coming from a single test file is ridiculous.
@nlohmann
just tried to pack every released json.hpp into this repo:
azadkuh/nlohmann_json_release
# size in KB
$> curl https://api.github.com/repos/nlohmann/json | grep size
"size": 136956,
$> curl https://api.github.com/repos/azadkuh/nlohmann_json_release | grep size
"size": 100,
I would prefer not to maintain my own minified fork (or depend on anyone else's).
Wouldn't a submodule solution meet everyone's requirements?
git clone [email protected]:nlohmann/json.gitgit clone --recursive [email protected]:nlohmann/json.gitI believe this would make everybody happy indeed.
_I know this is already closed, but as the discussion is still going on, I want to share my thoughts_
I wonder why this is needed? I mean @nlohmann is right it is nice to have everything in one place for development and IMHO this is the what this repo is for.
If you use this as a submodule then you want to use also the tests and so on as you are developing for this lib.
And after all it is a single header library so how can you get confused which version you have?
But to have the luxury of a git-submodule I would propose a different approach.
Create a sepeate repo only for released versions and only having the very view things that are needed (Readme, license, src, CMake, ...)
This repo can be maintained by the community and will not affect any development on this repo.
coherent with #698 and #643 and #700
separate repo from actual development
Just to add to the discussion: I have created a repository at https://github.com/astoeckel/json which tracks the released json.hpp.
Note however that this repository is bare-bones and tailored towards my own needs. It just contains src/json.hpp and meson.build, as well as the licence and a minimal readme. Feel free to use the provided Python script update.py to create your own version of the tracking repository.
As of now, the size of repository has increased to whopping 435 MBs. The biggest offenders are test, benchmarks, and git history itself (.git):
isaac@host:~/temp/json# du -k -d 1
8 ./.circleci
724 ./single_include
44 ./.github
816 ./include
5908 ./doc
8 ./cmake
179368 ./.git
300 ./third_party
59388 ./benchmarks
197800 ./test
444664 .
root@host1:~/temp/json#
To reduce the size of test and benchmarks directory, we have to move their large files (mostly json/data files) into another source (e.g. a separate git repository). The questions is now how to populate and integrate them during testing/benchmarking, I can think of these two options:
Download them during the build by CMake's ExternalProject command from the external source (e.g. Git repo, HTTP server, etc).
Add them as a git submodule to the main repo.
I believe using ExternalProject is a better option since the resources will be downloaded automatically when building benchmarks/tests is triggered. The best part is that user does not have to do anything whether he/she wants to run the test/benchmarks or not.
We can later deal with the size of .git and try to rewrite the git history to remove the commits related to large files. Since these files has been mostly static, number of commit related to them is probably very low. Therefore, the other contributors shouldn't have much trouble (if any) in rebasing their clone with the revised upstream repo.
Thanks for investigating! I have little time in the moment, but this is definitely a way to move forward!
I had a PR closed by the stale bot which used ExternalData, I don't have the time to look into it right now, so if somebody wants to go back from where I left, don't hesitate.
@nlohmann: Sure! I've created a PR and the preliminary tests seems promising.
@theodelrieu: Thanks for pointing that out! I will look into it.
Since this change can affect the workflow of the users, there is the possibility that I've overlooked some use-cases. Therefore, besides passing the CI tests, hearing others' experience would important to make sure this change will not have negative effects.
@nickaein @theodelrieu @gregmarr
Hi all, I finally had time to think about this issue and thing we should move forward with it.
As a first working draft, I created a branch https://github.com/nlohmann/json/tree/tests_external which loads the tests from a separate repository via
include(FetchContent)
FetchContent_Declare(nlohmann_json_tests
GIT_REPOSITORY https://github.com/nlohmann/json_tests_tmp.git
GIT_SHALLOW TRUE
)
FetchContent_GetProperties(nlohmann_json_tests)
if(NOT nlohmann_json_tests_POPULATED)
message(STATUS "Download tests")
FetchContent_Populate(nlohmann_json_tests)
message(STATUS "Download tests - done")
add_subdirectory(${nlohmann_json_tests_SOURCE_DIR}/test)
endif()
As a result, the tests folder can be removed from the repository, leaving a much smaller footprint.
However, I am not yet sure whether this is the best approach. In particular, I am afraid that tests and code get out of sync as a PR in the original repo cannot simply add tests on its own, but would need a PR in the tests repo first. Alternatively, I could only move the test files to the separate repo, but leave the JSON files in the separate repo.
Any ideas on this?
I think splitting code and tests into a separate repositories is a bad idea. Newer git versions have "sparse checkout" support, see https://www.git-scm.com/docs/git-sparse-checkout, so it would maybe possible to not checkout the tests folder.
For general users the various package managers should also help.
@nlohmann
this repository already has a great but large history, IMHO this repository is just fine for contributors with all test stuff and ..., but the final users may only need a single header file, and any other thing means YAGNI.
it may be a better idea to have a separate repository for released versions of json.hpp, this approach is both easier for you and other contributors to work on and manage this repo and also it's easier for final users to fetch a light repository with a single header file.
BTW every solution has its own pros/cons
@nlohmann
That's great! I'm not 100% sure about this solution yet, but seems be a good compromise.
I am afraid that tests and code get out of sync as a PR in the original repo cannot simply add tests on its own, but would need a PR in the tests repo first.
I agree. There is also the problem that after a while a user may want to checkout an older version of the library while the latest json_tests repo might not be backward-compatible with that version (e.g. a test file has been deprecated and removed in some commit).
Possible solution: We might have to (internally) tie the library version to a json_tests commit/tag. In FetchContent_Declare this can be done by using GIT_TAG. This can mitigate the out of sync problems as the PR author can explicitly point to a specific json_tests commit/tag (almost always the latest tag as the time of creating PR).
I could only move the test files to the separate repo, but leave the JSON files in the separate repo.
I believe you meant to keep the source code for tests here and only move the large JSON files into the separate repo.
I totally agree with this approach. The changes in the library and test source code are related in meaningful way and there is semantic to see them in single commit (or a single PR).
On the other hand, since the large JSON data for tests doesn't have much semantics and they rarely change, we can have them as an external repo.
There is also ~58 MBs worth of data for benchmarks. While they have smaller impact (compared to ~194 MBs of test data), we can move them if you want to be conservative.
In future though, both of them can be moved to an external repo.
Putting aside the technicalities, I believe we should consider the use-cases which make the users raise this issue. In some cases the user would be happy if along the single-header release, README, LICENSE, and CMakeList.txt (probably a stripped-down version) is also provided. For instance: #1066 and #1184.
Currently, they find the contents of releases inadequate and fall back to cloning the whole repo. This can be avoided if we add such files to the zip releases. I'm not sure if there are other considerations though.
@t-b
Sparse checkout is an interesting solution! However, I wonder if it could be effective in the scenarios that people are complaining about repo size. The users can already use this solution, yet we still see complains. In addition, their hand might be tied in using this feature (e.g. in a automated environment like CI where they can only specify a URL to git repo).
And lastly, we still increase the size of .git directory in future whenever we add/modify test data.
Nevertheless, this solution can probably help some of the users. We can recommend this (as a temporary solution at least), until further notice.
@azadkuh
but the final users may only need a single header file, and any other thing means YAGNI.
There is already single header files available for such users at Releases page. Does providing them in a separate repo have any benefits?
Thanks for the comments!
Hi all, please have a look at #2081 where I implemented the approach of having a separate test data repository.
Most helpful comment
To add to the discussion I will just add my argumentation here. I want to use "json" as a submodule to my project, because:
However, the current repository is extremely large, due to the tests and benchmarks. This hinders again the distribution of my code. Therefore it would be great if a minimal repository can be maintained.