Git-lfs: "git lfs pull origin" hangs indefinitely after upgrade to 2.6.0

Created on 2 Nov 2018  路  29Comments  路  Source: git-lfs/git-lfs

$ GIT_TRACE=1 git lfs pull origin 
13:37:24.585686 git.c:654               trace: exec: git-lfs pull origin
13:37:24.585782 run-command.c:637       trace: run_command: git-lfs pull origin
13:37:24.598218 trace git-lfs: exec: git 'version'
13:37:24.610754 trace git-lfs: exec: git 'config' '-l'
13:37:24.615020 trace git-lfs: exec: git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'rev-parse' 'HEAD' '--symbolic-full-name' 'HEAD'
13:37:24.619636 trace git-lfs: exec: git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'rev-parse' 'HEAD' '--symbolic-full-name' 'HEAD'
13:37:24.628521 trace git-lfs: tq: running as batched queue, batch size of 100
13:37:26.553478 trace git-lfs: filepathfilter: rejecting "AAA/.derived" via []
13:37:26.553571 trace git-lfs: filepathfilter: accepting "AAA/.derived"
13:37:26.553641 trace git-lfs: filepathfilter: rejecting "AAA/.project" via []
...

then the process is stuck.

Current workaround is to downgrade to git-lfs version 2.5.0.

Versions:

$ git lfs env
git-lfs/2.6.0 (GitHub; linux amd64; go 1.11.1; git 93702d75)
git version 2.18.0

Endpoint=https://XXXX/info/lfs (auth=basic)
LocalWorkingDir=/home/jenkins/workspace/workspace/AAAA
LocalGitDir=/home/jenkins/workspace/workspace/AAAA/.git
LocalGitStorageDir=/home/jenkins/workspace/workspace/AAAA/.git
LocalMediaDir=/home/jenkins/workspace/workspace/AAAA/.git/lfs/objects
LocalReferenceDirs=
TempDir=/home/jenkins/workspace/workspace/AAAA/.git/lfs/tmp
ConcurrentTransfers=3
TusTransfers=false
BasicTransfersOnly=false
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneRemoteName=origin
LfsStorageDir=/home/jenkins/workspace/workspace/AAAA/.git/lfs
AccessDownload=basic
AccessUpload=basic
DownloadTransfers=basic
UploadTransfers=basic
GIT_EXEC_PATH=/usr/libexec/git-core
git config filter.lfs.process = "git-lfs filter-process"
git config filter.lfs.smudge = "git-lfs smudge -- %f"
git config filter.lfs.clean = "git-lfs clean -- %f"

Most helpful comment

Okay, I've opened git-lfs/gitobj#6, which should fix this issue. I can reproduce the issue now, both with git lfs fetch and git lfs prune, interestingly enough. I will try to get the various pieces into gitobj merged, tag a v1.1.0 in that repository, and merge the two pieces in this repository (including a new vendored copy of gitobj). We should then be able to get this into a v2.6.1 fairly quickly.

Thanks, both of you, for your work in helping me track down and reproduce this issue.

All 29 comments

Hey, sorry to hear you're having trouble with the new version. I'd definitely like to see if we can fix this.

I'm not seeing this myself, so I have some questions to see if we can pin down what's going on.

  • Is the last message you're seeing one of the filepathfilter lines?
  • If not, could you include the final few (at least ten) lines of the GIT_TRACE output?
  • Does the output change if you also use GIT_TRANSFER_TRACE=1 (in addition to GIT_TRACE=1)?
  • Is there a progress meter when this hang occurs?

Hey, sorry to hear you're having trouble with the new version. I'd definitely like to see if we can fix this.

Hello, no problem, we have a working workaround in place (downgrade) so no problem :)

I'm not seeing this myself, so I have some questions to see if we can pin down what's going on.

  • Is the last message you're seeing one of the filepathfilter lines?

yes

  • If not, could you include the final few (at least ten) lines of the GIT_TRACE output?

Last lines are something like:

15:57:02.402049 trace git-lfs: filepathfilter: rejecting "file.java" via []
15:57:02.402154 trace git-lfs: filepathfilter: accepting "file.java"
  • Does the output change if you also use GIT_TRANSFER_TRACE=1 (in addition to GIT_TRACE=1)?

no, it doesn't change. I can see that the process is stuck after 103 lines like trace git-lfs: filepathfilter: accepting, batch size is 100.

  • Is there a progress meter when this hang occurs?

No, I cannot see it in version 2.6.0, whereas in version 2.5.0 I can.

I have a conjecture about what's going on here. I have a branch in bk2204/git-lfs called debug-lfs-hang, which you can build and try with GIT_TRACE=1. That should provide more debugging information for us to see if it's making progress, just very slowly, or if it's hanging.

I've also attached a suitable binary for Linux amd64 here here, if you don't want to build it yourself. The SHA-256 hash when uncompressed is 304e7cf1dd7b03e39bbe6fe95e3ccd53c9b86b15533db82d16954ce6454687a7.

Hello @bk2204, I've just tried (sorry I don't have access to this system during weekend). Using the binary you provided (commit fe2718f2):

$ git lfs env
git-lfs/2.6.0 (GitHub; linux amd64; go 1.11.1; git fe2718f2)
git version 2.18.0
$ GIT_TRACE=1 git-lfs pull origin
09:55:05.690216 trace git-lfs: exec: git 'version'
09:55:05.708718 trace git-lfs: exec: git 'config' '-l'
09:55:05.714694 trace git-lfs: exec: git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'rev-parse' 'HEAD' '--symbolic-full-name' 'HEAD'
09:55:05.721050 trace git-lfs: exec: git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'rev-parse' 'HEAD' '--symbolic-full-name' 'HEAD'
09:55:05.730892 trace git-lfs: tq: running as batched queue, batch size of 100
09:55:09.752781 trace git-lfs: filepathfilter: rejecting "file.java" via []
09:55:09.752927 trace git-lfs: filepathfilter: accepting "file.java"
...

It still hangs after first batch and I don't see any additional log printed. Perhaps I'm doing something wrong with your binary?

No, that's actually very helpful. That helps me know that my conjecture was likely incorrect and that there's something else going on instead. I appreciate your assistance in testing.

Running it with strace -ftt would be useful?

Let me try anyway:

$ strace -ftt git lfs pull origin
...
[pid  1418] 16:57:53.520018 brk(NULL)   = 0x1776000
[pid  1418] 16:57:53.520188 brk(0x179b000) = 0x179b000
[pid  1418] 16:57:53.520427 brk(NULL)   = 0x179b000
[pid  1418] 16:57:53.520573 brk(NULL)   = 0x179b000
[pid  1418] 16:57:53.520729 brk(0x1793000) = 0x1793000
[pid  1418] 16:57:53.520950 brk(NULL)   = 0x1793000
[pid  1418] 16:57:53.521125 brk(NULL)   = 0x1793000
[pid  1418] 16:57:53.521281 brk(NULL)   = 0x1793000
[pid  1418] 16:57:53.521579 brk(0x1792000) = 0x1792000
[pid  1418] 16:57:53.521851 brk(NULL)   = 0x1792000
[pid  1418] 16:57:53.523496 write(1, "imizer.default.properties\000100644"..., 4096) = 4096
[pid  1406] 16:57:53.523685 <... epoll_pwait resumed> [{EPOLLIN, {u32=1566502240, u64=139652428131680}}], 128, -1, NULL, 0) = 1
[pid  1406] 16:57:53.523852 epoll_pwait(4,  <unfinished ...>
[pid  1418] 16:57:53.525011 write(1, "ob 879f967daf154995c8e9f0ede4093"..., 4096 <unfinished ...>

write is stuck for about 60 seconds, then I see a never ending loop of nanosleep/futex:

[pid  1403] 16:58:52.784207 nanosleep({0, 20000}, NULL) = 0 
[pid  1403] 16:58:52.784630 futex(0xce2530, FUTEX_WAIT_PRIVATE, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid  1403] 16:59:52.785107 nanosleep({0, 20000}, NULL) = 0
[pid  1403] 16:59:52.785528 futex(0xce2530, FUTEX_WAIT_PRIVATE, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
...
[pid  2331] 17:33:04.472070 futex(0xce2530, FUTEX_WAIT_PRIVATE, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid  2331] 17:34:04.472561 nanosleep({0, 20000}, NULL) = 0
[pid  2331] 17:34:04.472903 futex(0xce2530, FUTEX_WAIT_PRIVATE, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid  2331] 17:35:04.473506 nanosleep({0, 20000}, NULL) = 0
[pid  2331] 17:35:04.473956 futex(0xce2530, FUTEX_WAIT_PRIVATE, 0, {60, 0}
...

I think the process you're tracing here with PID 1418 is actually a git cat-file process, which makes sense. Could you attach the full strace here so that I can see what the other process is doing when that error occurs? I want to know if the error is related to buffering (in which case the new commit on my debug-lfs-hang branch might work) or if the error is related to the new object scanner.

If you're not comfortable posting it here, you can email it to my username at github.com.

I've built the binary (git-lfs/2.6.0 (GitHub; linux amd64; go 1.11.2; git 6f588663)) and tried in the same machine of yesterday.

Unfortunately it shows the same behavior of previous commit.

Hopefully this is enough, if not can send everything via email:

$pstree -lap 13930
strace,13930 -ftt /usr/bin/git-lfs pull
  鈹斺攢git-lfs,13932 pull
      鈹溾攢git,13947 -c filter.lfs.smudge= -c filter.lfs.clean= -c filter.lfs.process= -c filter.lfs.required=false ls-tree -r -l -z --full-tree d6f403e6c7107b39a15e53f970a83ab6ab01765d
      鈹溾攢{git-lfs},13933
      鈹溾攢{git-lfs},13934
      鈹溾攢{git-lfs},13935
      鈹溾攢{git-lfs},13936
      鈹溾攢{git-lfs},13937
      鈹溾攢{git-lfs},13938
      鈹溾攢{git-lfs},13940
      鈹斺攢{git-lfs},13942

strace output:

$ strace -ftt /usr/bin/git-lfs pull
...
[pid 13940] 10:01:44.357986 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:44.359523 write(1, "    1653\tEaiServer/src/test/java"..., 4096) = 4096
[pid 13940] 10:01:44.359777 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:44.359987 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:44.403664 write(1, "om/fis/eai/job/module/ASFCustomA"..., 4096) = 4096
[pid 13940] 10:01:44.403904 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:44.404107 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:44.431368 write(1, "c3b6a7ded9657af16fc8a9537089    "..., 4096) = 4096
[pid 13940] 10:01:44.431649 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:44.431895 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:44.432018 brk(NULL)   = 0x2086000
[pid 13947] 10:01:44.432274 brk(0x20a7000) = 0x20a7000
[pid 13947] 10:01:44.432498 brk(NULL)   = 0x20a7000
[pid 13947] 10:01:44.433012 brk(NULL)   = 0x20a7000
[pid 13947] 10:01:44.433232 brk(0x209f000) = 0x209f000
[pid 13947] 10:01:44.433445 brk(NULL)   = 0x209f000
[pid 13947] 10:01:44.433672 brk(NULL)   = 0x209f000
[pid 13947] 10:01:44.433797 brk(NULL)   = 0x209f000
[pid 13947] 10:01:44.433934 brk(0x209d000) = 0x209d000
[pid 13947] 10:01:44.434100 brk(NULL)   = 0x209d000
[pid 13947] 10:01:44.460291 write(1, "ateKeeperTimeValidTest.java\0001006"..., 4096) = 4096
[pid 13940] 10:01:44.460609 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:44.460827 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:45.017619 write(1, "\tFinace/build/db/sqlserver/Recre"..., 4096) = 4096
[pid 13940] 10:01:45.017947 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:45.018195 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:45.394107 write(1, "nace/config/images/toc_closed.gi"..., 4096) = 4096
[pid 13940] 10:01:45.394397 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:45.394581 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:45.395498 write(1, "late.xls\000100644 blob 5de2c045966"..., 4096) = 4096
[pid 13940] 10:01:45.395708 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:45.395888 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:45.396624 write(1, " 7750f5e6a4dca092a2410ae8cdd0dff"..., 4096) = 4096
[pid 13940] 10:01:45.396948 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:45.397072 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:45.397129 brk(NULL)   = 0x209d000
[pid 13947] 10:01:45.397381 brk(0x20c2000) = 0x20c2000
[pid 13947] 10:01:45.397617 brk(NULL)   = 0x20c2000
[pid 13947] 10:01:45.397858 brk(NULL)   = 0x20c2000
[pid 13947] 10:01:45.398087 brk(0x20ba000) = 0x20ba000
[pid 13947] 10:01:45.398324 brk(NULL)   = 0x20ba000
[pid 13947] 10:01:45.398618 brk(NULL)   = 0x20ba000
[pid 13947] 10:01:45.398876 brk(NULL)   = 0x20ba000
[pid 13947] 10:01:45.399094 brk(0x20b8000) = 0x20b8000
[pid 13947] 10:01:45.399317 brk(NULL)   = 0x20b8000
[pid 13947] 10:01:46.488126 write(1, "sReportTemplates/UpcomingDividen"..., 4096) = 4096
[pid 13940] 10:01:46.488476 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:46.488698 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:46.489462 write(1, "n/AC_OCBC-105_000_core_finace_dd"..., 4096) = 4096
[pid 13940] 10:01:46.489701 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:46.489851 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:46.490877 write(1, "tion/F_PB-38402_000_core_finace_"..., 4096) = 4096
[pid 13940] 10:01:46.491116 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:46.491287 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:46.492128 write(1, "re_finace_ddl.xml\000100644 blob 3f"..., 4096) = 4096
[pid 13940] 10:01:46.492381 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:46.492526 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:46.493408 write(1, "100644 blob 60669d9e49a7f4be4e6b"..., 4096) = 4096
[pid 13940] 10:01:46.493645 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:46.493792 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:46.583296 write(1, "   723\tFinace/db_migration/F_SRE"..., 4096) = 4096
[pid 13940] 10:01:46.583581 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:46.583747 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:46.584057 brk(NULL)   = 0x20b8000
[pid 13947] 10:01:46.584217 brk(0x20de000) = 0x20de000
[pid 13947] 10:01:46.584421 brk(NULL)   = 0x20de000
[pid 13947] 10:01:46.584599 brk(NULL)   = 0x20de000
[pid 13947] 10:01:46.584860 brk(0x20d6000) = 0x20d6000
[pid 13947] 10:01:46.585079 brk(NULL)   = 0x20d6000
[pid 13947] 10:01:46.585278 brk(NULL)   = 0x20d6000
[pid 13947] 10:01:46.585476 brk(NULL)   = 0x20d6000
[pid 13947] 10:01:46.585645 brk(0x20d4000) = 0x20d4000
[pid 13947] 10:01:46.585810 brk(NULL)   = 0x20d4000
[pid 13947] 10:01:46.586702 write(1, "imizer.default.properties\000100644"..., 4096) = 4096
[pid 13940] 10:01:46.586875 <... epoll_pwait resumed> [{EPOLLIN, {u32=2133712016, u64=139717419850896}}], 128, -1, NULL, 0) = 1
[pid 13940] 10:01:46.587322 epoll_pwait(4,  <unfinished ...>
[pid 13947] 10:01:46.587534 write(1, "ob 879f967daf154995c8e9f0ede4093"..., 4096 <unfinished ...>

the last write is stuck, then we got the usual futex/nanosleep loop:

[pid 13933] 10:02:44.358005 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 13933] 10:02:44.358186 nanosleep({0, 20000}, NULL) = 0
[pid 13933] 10:02:44.358476 futex(0xce2530, FUTEX_WAIT_PRIVATE, 0, {60, 0}

Perhaps it is useful to note that if the process git -c filter.lfs.smudge= -c filter.lfs.clean= -c filter.lfs.process= -c filter.lfs.required=false ls-tree -r -l -z --full-tree d6f403e6c7107b39a15e53f970a83ab6ab01765d is invoked separately, it completes without any problem.

@bk2204 I tried to exclude parsing errors of git ls-tree in https://github.com/dfa1/git-lfs/commit/6441a192639034c133e8d3e6c32875def4e2e1b4 (beware: tests are not passing) by simply using a line-by-line approach.

I tested on the same machine as before with no luck: git-lfs is still hanging indefinitely. Please let me know if you need additional information or tests.

Is it possible for you to bisect this issue? I'd really like to know what's causing this issue, and the easiest way to do that would be to have you bisect it (since I presume I don't have access to your repo). If you have questions about this process, I'd be happy to walk you through it.

I think what's happening is that the reading process (the Git LFS process) is getting stuck, causing the git ls-tree process to hang, but I'm not sure why. If a bisection isn't possible, I have a new debug-lfs-hang branch which reverts a commit I think might be the cause that you could try.

I appreciate you patiently working through this with me.

Thanks for your work on this issue - we're experiencing the same issue too.

I've logged #3366 with the result of my bisect (I didn't want to pollute this in case) the issue is not the same underlying failure (although it looks the same).

For us the issue seems to be repository (ie data) dependent (a single repository on a server with about 10 active LFS repos shows errors only on a single repository) - is that consistent with what you have observed?

I think I can reproduce this issue now. If I unpack all of the objects in my test repository and run (ulimit -n 32 && git lfs checkout), I'm able to reproduce a hang, but not if I revert the commit you bisected to in #3366. I'm going to go on the assumption that these are the same issue and then I'll see if my patch for it addresses the issue for you.

I have a new version of my debug-lfs-hang branch (on my fork) that should address this. @dfa1, @whitty, could you test this branch and see if it fixes the issue for you? Note that it pulls in a new version of the gitobj module, which is required for the fix to be effective.

Thanks! I'm, having trouble getting access to the gitobj change (i'm a 100% go noob here so sorry)

Building in the same workspace as I did the bisect I get:

go generate github.com/git-lfs/git-lfs/commands
go: finding github.com/git-lfs/gitobj v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e
go: github.com/git-lfs/gitobj@v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e: unknown revision 4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e

I'm assuming the revistion is a git reference, but 4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e doesn't turn up in either bk2204/gitobj or git-lfs/gitobj repositories - is it pushed? Do I need to convince go to look elsewhere.

Previously I'd had to hack a symlink into the <gopath>/src/github.com/git-lfs/git-lfs to my checkout to get 2.5.x to build - I don't understand what that did there, but perhaps something similar is required to find gitobj?

hello @bk2204, thanks for your change!

I was able to build this new version:

$ /tmp/git-lfs-linux-amd64 env
git-lfs/2.6.0 (GitHub; linux amd64; go 1.11.2; git 93ac8a48)
git version 2.18.0
...

but unfortunately it still hangs after 103 files.

Perhaps it is useful to say that this happens regardless of ulimit and batch size.

After further investigation I ended by adding additional trace output in https://github.com/dfa1/git-lfs/commit/a13daae9f7ed40fdedc9aa47b8cb617394ff18fe:

$ GIT_TRACE=1 /tmp/git-lfs-linux-amd64 fetch
14:13:50.282868 trace git-lfs: exec: git 'version'
14:13:50.291878 trace git-lfs: exec: git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'rev-parse' 'HEAD' '--symbolic-full-name' 'HEAD'
14:13:50.295596 trace git-lfs: exec: git 'config' '-l'
fetch: Fetching reference HEAD
14:13:50.713262 trace git-lfs: filepathfilter: accepting "AAA"
14:13:50.713319 trace git-lfs: sending 'AAA'
14:13:50.713343 trace git-lfs:  ... ok
14:13:50.713262 trace git-lfs: filepathfilter: accepting "BBB"
14:13:50.713319 trace git-lfs: sending 'BBB'
14:13:50.713343 trace git-lfs:  ... ok
...
14:13:50.713674 trace git-lfs: ++ reading {Sha1:1de565933b05f74c75ff9a6520af5f9f8a5a2f1d Filename:AAA}, hasNext=false
... few more lines "filepathfilter: accepting ..." it hangs

It looks like that catFileBatchTree handles just one message, then since hasNext is false and it stops consuming data coming from git ls-tree.

Additionally, in git-lfs 2.5.0 I can see an additional log that I cannot see version in 2.6.0:

$ GIT_TRACE=1  git lfs  fetch
15:30:03.807847 git.c:654               trace: exec: git-lfs fetch
15:30:03.808043 run-command.c:637       trace: run_command: git-lfs fetch
15:30:03.823379 trace git-lfs: exec: git 'version'
15:30:03.834246 trace git-lfs: exec: git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'rev-parse' 'HEAD' '--symbolic-full-name' 'HEAD'
15:30:03.840328 trace git-lfs: exec: git 'config' '-l'
fetch: Fetching reference HEAD
15:30:03.845844 trace git-lfs: run_command: git cat-file --batch

The last line is missing in version 2.6.0, it is expected?

The git cat-file --batch being missing is expected. It's been replaced with a gitobj-based scanner instead.

The git cat-file --batch being missing is expected. It's been replaced with a gitobj-based scanner instead.

I asked that because the function that I've modified is still referring to git cat-file --batch. Besides that, do you have any additional ideas or things to check?

I have a new debug-lfs-hang branch which should make us handle the case where we get hasNext being false better and avoid the deadlock. If you could try it and report back, that would be great.

However, I'm still unclear why hasNext should ever be false at all. There isn't a case in normal operation where that should occur. As far as I can tell, it means one of our lower levels is getting an end-of-file somewhere unexpected. If you have any ideas about where that's coming from, or would like to do more testing to figure out where it's coming from, I'd be very appreciative.

My concern is that in fixing the deadlock, we're papering over the root cause of hasNext being false, which could lead to some files not being fetched properly.

I'm assuming the revistion is a git reference, but 4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e doesn't turn up in either bk2204/gitobj or git-lfs/gitobj repositories - is it pushed? Do I need to convince go to look elsewhere.

@whitty That's in a pull request (git-lfs/gitobj#5), and it should theoretically be available. I believe the change should be vendored in, so if you check out the repository, it should probably be in vendor/, which means that you may already have the change. You may be able to produce a tarball with git archive and try building that, which may cause the vendor directory to be used.

If not, let me know and I can provide a binary.

@bk2204 using your latest commit deadlock is gone, now the process terminates much nicely:

$ /tmp/git-lfs-linux-amd64 env
git-lfs/2.6.0 (GitHub; linux amd64; go 1.11.2; git 0845cc4f)
git version 2.18.0
...
$ GIT_TRACE=1 /tmp/git-lfs-linux-amd64 fetch
11:37:03.246903 trace git-lfs: exec: git 'version'
11:37:03.257953 trace git-lfs: exec: git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'rev-parse' 'HEAD' '--symbolic-full-name' 'HEAD'
11:37:03.262479 trace git-lfs: exec: git 'config' '-l'
fetch: Fetching reference HEAD
11:37:03.796515 trace git-lfs: filepathfilter: accepting "AAA"
... more lines
11:37:03.796903 trace git-lfs: filepathfilter: accepting "BBB"
Could not scan for Git LFS files
...
Errors logged to /home/jenkins/workspace/workspace/repo/.git/lfs/logs/20181109T113703.796992051.log
Use `git lfs logs last` to view the log.

The log file is reporting this error missing object: a0a27e7c04fe2590e367fc869f30dbe4f8b8f4b3, bu git is able to resolve this hash:

$ git cat-file -t a0a27e7c04fe2590e367fc869f30dbe4f8b8f4b3
blob

EDIT: perhaps it is important to say this blob (a0a27e7c04fe2590e367fc869f30dbe4f8b8f4b3) is coming from a reference repo configured in .git/objects/info/alternates.

Using strace I just see one attempt made to resolve the blob against the local .git object store:

[pid  4247] 13:37:50 openat(AT_FDCWD, "/home/jenkins/workspace/workspace/repo/.git/objects/a0/a27e7c04fe2590e367fc869f30dbe4f8b8f4b3", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid  4245] 13:37:50 <... epoll_pwait resumed> [], 128, 0, NULL, 0) = 0
[pid  4242] 13:37:50 nanosleep({0, 20000},  <unfinished ...>
[pid  4247] 13:37:50 <... openat resumed> ) = -1 ENOENT (No such file or directory)
[pid  4245] 13:37:50 futex(0xc00007a140, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4247] 13:37:50 futex(0xc00007a140, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4245] 13:37:50 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid  4242] 13:37:50 <... nanosleep resumed> NULL) = 0
[pid  4247] 13:37:50 <... futex resumed> ) = 0
[pid  4245] 13:37:50 nanosleep({0, 3000},  <unfinished ...>
[pid  4242] 13:37:50 nanosleep({0, 20000},  <unfinished ...>
[pid  4247] 13:37:50 write(2, "Scanner error: missing object: a"..., 72Scanner error: missing object: a0a27e7c04fe2590e367fc869f30dbe4f8b8f4b3
 <unfinished ...>

@whitty That's in a pull request (git-lfs/gitobj#5), and it should theoretically be available. I believe the change should be vendored in, so if you check out the repository, it should probably be in vendor/, which means that you may already have the change. You may be able to produce a tarball with git archive and try building that, which may cause the vendor directory to be used.

I tried the git-archive method and unpacked to a separate directory, but that still looks for gitobj upstream (and doesn't behave nicely without being in git - probably git-describe is used?) Perhaps I'm missing something? I note the "build from source" notes in the wiki don't mention building outside git.

$ make
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
fatal: Not a git repository (or any of the parent directories): .git
go: finding github.com/git-lfs/gitobj v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e
go: github.com/git-lfs/gitobj@v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e: unknown revision 4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e
go: error loading module requirements
go: finding github.com/git-lfs/gitobj v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e
go: github.com/git-lfs/gitobj@v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e: unknown revision 4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e
go: error loading module requirements
git-lfs: skipping fmt, no goimports found at `goimports` ...
go generate github.com/git-lfs/git-lfs/commands
go: finding github.com/git-lfs/gitobj v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e
go: github.com/git-lfs/gitobj@v1.0.0-20181107174809-4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e: unknown revision 4be4f67f60bbb94f22c55c1b9e526dfd11e6e07e
go: error loading module requirements
Makefile:160: recipe for target 'commands/mancontent_gen.go' failed

If not, let me know and I can provide a binary.

That would help - at least to verify we have a single problem

EDIT: perhaps it is important to say this blob (a0a27e7c04fe2590e367fc869f30dbe4f8b8f4b3) is coming from a reference repo configured in .git/objects/info/alternates.

bingo - perhaps that is why I was unable to replicate this issue standalone, only on my build-machine - it is specific to the use of --reference. I'll try to get an easier reproduction for you.

Tried git repack -a to try to remove any external references, but that didn't seem to achieve anything

@whitty thanks for testing. I was able to reproduce this locally by cloning a local repository using both --reference and --shared options.

Thanks, that's very helpful. I don't think we handle that case in gitobj, although we should. That would explain why things are broken. I'll see if I can get that issue fixed ASAP.

Okay, I've opened git-lfs/gitobj#6, which should fix this issue. I can reproduce the issue now, both with git lfs fetch and git lfs prune, interestingly enough. I will try to get the various pieces into gitobj merged, tag a v1.1.0 in that repository, and merge the two pieces in this repository (including a new vendored copy of gitobj). We should then be able to get this into a v2.6.1 fairly quickly.

Thanks, both of you, for your work in helping me track down and reproduce this issue.

thanks @bk2204.

I can provide some testing support on Centos Linux (where the issue has been originally discovered), Windows and macOS.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

obe1line picture obe1line  路  31Comments

strich picture strich  路  28Comments

smithkl42 picture smithkl42  路  41Comments

shoelzer picture shoelzer  路  26Comments

jplu picture jplu  路  50Comments