Runtime: Enable Git LFS and merge the corefx-testdata repo with corefx

Created on 26 Sep 2019  路  27Comments  路  Source: dotnet/runtime

From my conversation with @ericstj, the steps would be:

  • Make corefx-testdata adopt arcade.
  • Create a rolling build.
  • Have corefx subscribe to the output of that build.

cc @safern @joperezr @ViktorHofer @Anipik

area-Infrastructure-libraries

Most helpful comment

Bear in mind you have over 5,000 forks and every external contributor is going to need to install the git-lfs extension. I think an announcement period would be useful.

I looked at how GitHub handles forks for LFS, and it seems that external contributors will not need to pay for git LFS, but every fork will count against this repository's bandwidth and quota limit.

All 27 comments

IMO that's an overkill. Maintaining an arcade enabled repository is painful. Instead I would invest in enabling LFS in corefx.

@ViktorHofer how about we set that up in the new repo from day one? Then we leave testdata behind?

I'll put that on the agenda for the next consolidation meeting.

What was the reason corefx-testdata was created instead of using LFS? I seem to remember seeing some text about this but I don't know where I saw it.

It predated github support for LFS. IIRC folks didn't switch to LFS due to lack of support on the internal builds as well as the need for developers to install it locally and clone in a special way. Presumably we can approach this freshly in the context of a new repo to see if we can cleanly solve those if we're defining the entry-points for clone/build.

cc @jaredpar

Here's a bit more info: https://github.com/dotnet/core-eng/issues/4105. Clearly some consideration for the mirror would need to be done and the other issues @MattGal brought up at that time.

@MattGal can we try out mirroring a sample repo with LFS enabled? Ie the corefx-testdata repo itself would be a good candidate.

Bear in mind you have over 5,000 forks and every external contributor is going to need to install the git-lfs extension. I think an announcement period would be useful.

I looked at how GitHub handles forks for LFS, and it seems that external contributors will not need to pay for git LFS, but every fork will count against this repository's bandwidth and quota limit.

Bear in mind you have over 5,000 forks and every external contributor is going to need to install the git-lfs extension.

These days it's bundled in Git for Windows, available from Homebrew, or a bunch of other ways depending on your distro.

Bear in mind you have over 5,000 forks

Excellent point. This is why I was suggesting this be done at the start of the combined repository.

every fork will count against this repository's bandwidth and quota limit.

Fortunately the large test assets hardly ever change.

Bear in mind you have over 5,000 forks and every external contributor is going to need to install the git-lfs extension. I think an announcement period would be useful.

Agree. At the same time if this step is taken it would only be in the new consolidated repository, not in CoreFX. So everyone would be starting fresh. That being said as we get closer to making the consolidated repository a reality there will be an announcement describing it, the changes everyone should expect and what our minimum tooling requirements are which would include items like LFS.

every fork will count against this repository's bandwidth and quota limit.

That does appear to be the case :

Forking and pulling a repository counts against the parent repository's bandwidth limit.

The storage quota doesn't seem so bad but the bandwidth restriction feels very limiting for us. Our repositories are cloned and built thousands of times a day just by our own CI system (must consider that each build in CI has X jobs each of which does a clone that will count to bandwidth). Once we add in forks that seems like it could be a problem. Need to dig deeper into this before we bet on LFS as a solution.

@MattGal can we try out mirroring a sample repo with LFS enabled? Ie the corefx-testdata repo itself would be a good candidate.

Sorry I've been away and ill. I think LFS is now supported everywhere we push repos, but the concerns of this thread are still valid. If you want help setting up an experiment though, feel free to ping me offline.

https://help.github.com/en/articles/about-storage-and-bandwidth-usage
https://help.github.com/en/articles/about-billing-for-git-large-file-storage

50GB storage + 50GB of bandwidth costs 5 dollars per month. Storage

Storage:

If you push a 500 MB file to Git LFS, you'll use 500 MB of your allotted storage and none of your bandwidth

Bandwidth:

If you download a 500 MB file that's tracked with LFS, you'll use 500 MB of the repository owner's allotted bandwidth. In forks, bandwidth and storage usage count against the root of the repository network.

Our corefx-testdata respository currently has 116MB of testdata. I don't have numbers of how often we fresh clone our repository but I wouldn't be surprised if that happens about 50 times per day which results in 5.8GB bandwidth per day and 174GB per month. That means we would need to buy 35 data plans which cost 174 dollars per month.

This is really expensive...

Beyond the expense...

If I want to clone the repo just for browsing and searching, and maybe building, but I never care about running tests (this is the case for 90% of the repos I have cloned), will I now need to pull down over a hundred more megabytes unnecessarily? I'm fortunate to have a fast connection, but many developers do not.

@stephentoub there's an environment variable you can set to opt-out: GIT_LFS_SKIP_SMUDGE=1 https://github.com/git-lfs/git-lfs/issues/2406

The file pointers are replaced with the actual files during the branch checkout, means that the smudge filter only runs for the files in the current branch. That said, you will still download the freshest LFS files. Setting the env var seems to be the only option to avoid that. It would be nice to have an option for that during git clone / git checkout.

I don't have numbers of how often we fresh clone our repository but I wouldn't be surprised if that happens about 50 times per day which results in 5.8GB bandwidth per day and 174GB per month. That means we would need to buy 35 data plans which cost 174 dollars per month.

Looking at just the PR queue for CoreFX we clone several hundred times a day. Consider just the following few dates:

  • Monday: 660 clones
  • Tuesday: 440 clones

The actual number of builds created via PRs is ~40-50 but each build has 10+ jobs which execute a checkout operation.

The actual number is likely much higher here because this is counting CoreFX PR only. To truly get the number for the consolidated repository we need to consider PR + CI + Official builds for core-setup, coreclr and corefx + likely grab some data from GitHub about how many other clone operations we see per day.

I'd wager that CI / PR is our biggest cost here though. Those clones are pretty much guaranteed to be fresh, or at least need to be planned for as fresh. Individual developers likely don't do a fresh clone every time hence not as impactful here. But I'd still like to find a way to get data on that.

edit removed some ambiguity around job vs. build

@jaredpar Presumably someone could look at https://github.com/dotnet/corefx/graphs/traffic and get an idea of # of clones?

@vcsjones thanks didn't realize that existed. Was exactly what I was looking for.

Do you know what the difference is between clone and unique clone?

Also does that cover forks? Basically when someone clones a fork of corefx does it show up there? The fork cloning will count against our LFS bandwidth hence need to find a way to track that. Probably peanuts though compared to the main cloning but would want to find some data that's the case

Do you know what the difference is between clone and unique clone?

@jaredpar "Unique cloners" (note the r) is I believe the count of unique entities performing clone operations

1270 clones per day!!!! Take my calculation and multiply it by 25, that would be 4.3k / month.

Also does that cover forks?

We should find that out.

@shiftkey

"Unique cloners" (note the r) is I believe the count of unique entities performing clone operations

Gotcha. So for this conversation we should be looking at clones.

1270 clones per day!!!!

Interesting so our CI counts for about 1/2 of all clones then good to know. 馃槮

Take my calculation and multiply it by 25, that would be 4.3k / month.

Yep.

We decided against merging these two repositories and instead

1) renamed corefx-testdata to https://github.com/dotnet/runtime-assets
2) added CI integration
3) Arcadified the repository
4) and enabled dependency flow

Was this page helpful?
0 / 5 - 0 ratings