No, it seems the code is same.
128G mem & 1.8T SSD
Linux version 3.10.0-1062.9.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019
At first the cluster work fine , after about 30 minutes, any mutation or query can not apply, it seems the whole cluster is hung.
alpha 3 log
I0205 17:05:30.041988 258429 log.go:34] Rolling up Time elapsed: 01m43s, bytes sent: 3.1 GB, speed: 30 MB/sec
I0205 17:05:31.042005 258429 log.go:34] Rolling up Time elapsed: 01m44s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:32.041998 258429 log.go:34] Rolling up Time elapsed: 01m45s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:33.042004 258429 log.go:34] Rolling up Time elapsed: 01m46s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:34.041991 258429 log.go:34] Rolling up Time elapsed: 01m47s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:35.041990 258429 log.go:34] Rolling up Time elapsed: 01m48s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:36.041991 258429 log.go:34] Rolling up Time elapsed: 01m49s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:37.041989 258429 log.go:34] Rolling up Time elapsed: 01m50s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:38.048405 258429 log.go:34] Rolling up Time elapsed: 01m51s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:39.052383 258429 log.go:34] Rolling up Time elapsed: 01m52s, bytes sent: 3.1 GB, speed: 27 MB/sec
I0205 17:05:40.042005 258429 log.go:34] Rolling up Time elapsed: 01m53s, bytes sent: 3.1 GB, speed: 27 MB/sec
E0205 17:06:55.871480 258429 draft.go:442] Error while rolling up lists at 18061: Value with size 1439997117 exceeded 1073741823 limit. Value:
00000000 0a b7 b9 d2 ae 05 12 13 08 83 80 80 80 80 80 80 |................|
00000010 80 10 12 05 00 00 00 00 00 18 01 12 13 08 84 80 |................|
00000020 80 80 80 80 80 80 10 12 05 00 00 00 00 00 18 01 |................|
00000030 12 13 08 87 80 80 80 80 80 80 80 10 12 05 00 00 |................|
00000040 00 00 00 18 01 12 13 08 88 80 80 80 80 80 80 80 |................|
00000050 10 12 05 00 00 00 00 00 18 01 12 13 08 8b 80 80 |................|
00000060 80 80 80 80 80 10 12 05 00 00 00 00 00 18 01 12 |................|
00000070 13 08 92 80 80 80 80 80 80 80 10 12 05 00 00 00 |................|
00000080 00 00 18 01 12 13 08 95 80 80 80 80 80 80 80 10 |................|
alpha 1 log
I0205 17:04:21.971401 7770 log.go:34] Rolling up Created batch of size: 941 kB in 75.65334ms.
I0205 17:04:22.103731 7770 log.go:34] Rolling up Created batch of size: 929 kB in 76.538732ms.
I0205 17:04:22.171771 7770 log.go:34] Rolling up Created batch of size: 925 kB in 66.394744ms.
I0205 17:04:22.257937 7770 log.go:34] Rolling up Created batch of size: 927 kB in 66.859031ms.
I0205 17:04:22.334075 7770 log.go:34] Rolling up Created batch of size: 932 kB in 74.420747ms.
I0205 17:04:22.417214 7770 log.go:34] Rolling up Created batch of size: 929 kB in 68.814851ms.
I0205 17:04:23.277135 7770 log.go:34] Rolling up Time elapsed: 37s, bytes sent: 328 MB, speed: 8.9 MB/sec
I0205 17:04:23.601378 7770 log.go:34] Rolling up Time elapsed: 38s, bytes sent: 328 MB, speed: 8.6 MB/sec
I0205 17:04:24.601421 7770 log.go:34] Rolling up Time elapsed: 39s, bytes sent: 328 MB, speed: 8.4 MB/sec
I0205 17:04:25.403778 7770 log.go:34] Rolling up Created batch of size: 884 kB in 59.479535ms.
I0205 17:04:25.601506 7770 log.go:34] Rolling up Time elapsed: 40s, bytes sent: 329 MB, speed: 8.2 MB/sec
I0205 17:22:45.611681 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:23:45.620155 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:24:45.623038 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:25:45.627988 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:26:45.633200 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:26:45.633257 7770 draft.go:1194] Found 1 old transactions. Acting to abort them.
I0205 17:26:45.633781 7770 draft.go:1155] TryAbort 1 txns with start ts. Error:
I0205 17:26:45.633795 7770 draft.go:1178] TryAbort selectively proposing only aborted txns: txns:
I0205 17:29:45.819500 7770 draft.go:403] Creating snapshot at index: 20373. ReadTs: 26306.
is there any body could help me?
Hey @JimWen, I am checking this.
I can see rollup is failed because of very large value size. Not very sure, how can we have this big size for any entry.
What kind data are you trying to insert? Does it have many entries for single subject, predicate combination?
Also while running queries/mutations what kind of response are you getting.
thank you @ashish-goswami
The data is User behavior in app like click/chat/signin/signup etc. And so the schme is like
type User{
gid_user: int
name: string
src: int
is: [uid]
has: [uid]
}
type Action{
gid_act:string
aid:string
t:dateTime
chl:string
pl:string
net:string
status:int
msg:string
extmsg:[string]
from:uid
to:uid
with:[uid]
}
and the graph is like
user1 - from- action1-to-user2
user1 - from- action2-to-user3
user1 - from- action3-to-user4
So, for a single user , it could have many from-action pairs, ie many entries for single subject, predicate combination.
i got nothing , just hung, i check the log again, the following may br useful, alpha 1 print endless log like
0206 20:57:45.601604 7770 draft.go:1194] Found 1 old transactions. Acting to abort them.
I0206 20:57:45.602277 7770 draft.go:1155] TryAbort 1 txns with start ts. Error:
I0206 20:57:45.602293 7770 draft.go:1171] TryAbort: No aborts found. Quitting.
I0206 20:57:45.602300 7770 draft.go:1197] Done abortOldTransactions for 1 txns. Error:
I0206 20:58:45.601642 7770 draft.go:1194] Found 1 old transactions. Acting to abort them.
I0206 20:58:45.602308 7770 draft.go:1155] TryAbort 1 txns with start ts. Error:
I0206 20:58:45.602325 7770 draft.go:1171] TryAbort: No aborts found. Quitting.
I0206 20:58:45.602333 7770 draft.go:1197] Done abortOldTransactions for 1 txns. Error:
I0206 20:59:45.601597 7770 draft.go:1194] Found 1 old transactions. Acting to abort them.
I0206 20:59:45.602335 7770 draft.go:1155] TryAbort 1 txns with start ts. Error:
I0206 20:59:45.602351 7770 draft.go:1171] TryAbort: No aborts found. Quitting.
I0206 20:59:45.602358 7770 draft.go:1197] Done abortOldTransactions for 1 txns. Error:
I0206 21:00:45.601632 7770 draft.go:1194] Found 1 old transactions. Acting to abort them.
I0206 21:00:45.602271 7770 draft.go:1155] TryAbort 1 txns with start ts. Error:
I0206 21:00:45.602282 7770 draft.go:1171] TryAbort: No aborts found. Quitting.
I0206 21:00:45.602288 7770 draft.go:1197] Done abortOldTransactions for 1 txns. Error:
the log is Value exceeded, so the kv here store as key(subject, predicate combination) and value(many entries) ?
@danielmai : Let's test the posting list split for Jepsen and re-introduce it, and release a patch fix.
@JimWen : You can revert back to an earlier version, that'd help get your cluster back up and running.
Dgraph v1.2.1 patch fix is out with posting list splits enabled. You can upgrade your cluster to this release. Large posting lists will be split during rollups. https://github.com/dgraph-io/dgraph/releases/v1.2.1
thank you very much, i will have a try @danielmai
I'll close this issue as the patch fix v1.2.1 is released. Please reopen if you're still seeing this issue.
The problem is still there, i'm wondering what the value that exceeded actually is and what the fix release split is.
The edges here grow very soon, about 1 billion every day and when i check the source code, i found that rolling up is set a fixed time interval of 5 minutes, is it reasonable here?
@JimWen Did you do an export/import of your v1.2.0 database into a new v1.2.1 database? When you load the data into a new v1.2.1 Dgraph instance the posting lists would be split when rollups happen.
You can configure how often snapshots (and rollups) happen via the --snapshot_after flag. By default, it's set to snapshot every 10,000 wal entries. You can set it to a lower value to make rollups more frequent.
$ dgraph alpha --help
...
--snapshot_after int Create a new Raft snapshot after this many number of Raft entries. The lower this number, the more frequent snapshot creation would be. Also determines how often Rollups would happen. (default 10000)
...
@danielmai Thank you
Considering the dataset is too big, i just rebulk load the whole data, export/import may be very slow in this situation?
--snapshot_after is something i want, i will have a try.
bad news, when i export data there is also value exceeding error, and it is the same alpha, so maybe we can say this problem is caused by bulkload when bulkload does not do any split. the log is as followings
_I0211 11:50:13.031266 228800 log.go:34] Export Created batch of size: 203 MB in 15.240465535s.
I0211 11:50:13.031313 228800 log.go:34] Export Time elapsed: 08m32s, bytes sent: 5.6 GB, speed: 11 MB/sec
I0211 11:50:13.414136 228800 log.go:34] Export Created batch of size: 194 MB in 14.331708926s.
I0211 11:50:13.414179 228800 log.go:34] Export Time elapsed: 06m26s, bytes sent: 3.9 GB, speed: 10 MB/sec
I0211 11:50:16.495130 228800 log.go:34] Export Created batch of size: 204 MB in 16.001207551s.
I0211 11:50:16.495172 228800 log.go:34] Export Time elapsed: 06m22s, bytes sent: 3.7 GB, speed: 9.8 MB/sec
I0211 11:50:20.091869 228800 log.go:34] Export Created batch of size: 204 MB in 16.797496957s.
E0211 11:50:25.814313 228800 draft.go:442] Error while rolling up lists at 17062: Value with size 2047633608 exceeded 1073741823 limit. Value:
00000000 0a c2 d1 b1 d0 07 12 13 08 ba a8 bf ae 80 80 80 |................|
00000010 80 10 12 05 00 00 00 00 00 18 01 12 13 08 bb a8 |................|
00000020 bf ae 80 80 80 80 10 12 05 00 00 00 00 00 18 01 |................|
00000030 12 13 08 bc a8 bf ae 80 80 80 80 10 12 05 00 00 |................|
00000040 00 00 00 18 01 12 13 08 bd a8 bf ae 80 80 80 80 |................|
00000050 10 12 05 00 00 00 00 00 18 01 12 13 08 be a8 bf |................|
00000060 ae 80 80 80 80 10 12 05 00 00 00 00 00 18 01 12 |................|
00000070 13 08 bf a8 bf ae 80 80 80 80 10 12 05 00 00 00 |................|
00000080 00 00 18 01 12 13 08 c0 a8 bf ae 80 80 80 80 10 |................|
00000090 12 05 00 00 00 00 00 18 01 12 13 08 c1 a8 bf ae |................|
000000a0 80 80 80 80 10 12 05 00 00 00 00 00 18 01 12 13 |................|
000000b0 08 c2 a8 bf ae 80 80 80 80 10 12 05 00 00 00 00 |................|
000000c0 00 18 01 12 13 08 c3 a8 bf ae 80 80 80 80 10 12 |................|
000000d0 05 00 00 00 00 00 18 01 12 13 08 c4 a8 bf ae 80 |................|
000000e0 80 80 80 10 12 05 00 00 00 00 00 18 01 12 13 08 |................|
000000f0 c5 a8 bf ae 80 80 80 80 10 12 05 00 00 00 00 00 |................|
00000100 18 01 12 13 08 c6 a8 bf ae 80 80 80 80 10 12 05 |................|
00000110 00 00 00 00 00 18 01 12 13 08 c7 a8 bf ae 80 80 |................|
00000120 80 80 10 12 05 00 00 00 00 00 18 01 12 13 08 c8 |................|
00000130 a8 bf ae 80 80 80 80 10 12 05 00 00 00 00 00 18 |................|
00000140 01 12 13 08 c9 a8 bf ae 80 80 80 80 10 12 05 00 |................|
00000150 00 00 00 00 18 01 12 13 08 ca a8 bf ae 80 80 80 |................|
00000160 80 10 12 05 00 00 00 00 00 18 01 12 13 08 cb a8 |................|_
I also encountered the same issue and have updated to the latest version, this issue has not been resolved
Hey @FelixHolmes ,
We'll verify this from our end and reply on this thread.
Posting lists are now split when using bulk loader (https://github.com/dgraph-io/dgraph/pull/4967) and when using backup/restore (#4912). These changes should address the issue you're seeing. I'll close this issue for now. Feel free to reopen this issue if you're still seeing it with these latest changes.
We are experiencing similar issue, one of the Alphas node were restarted (not sure if this was OOM or something else) but till then it cannot sync with the cluster, see the log entries below:
Dgraph version : v20.03.1
Dgraph SHA-256 : 6a40b1e084205ae9e29336780b3458a3869db45c0b96b916190881c16d705ba8
Commit SHA-1 : c201611d6
Commit timestamp : 2020-04-24 13:53:41 -0700
Branch : HEAD
Go version : go1.14.1
For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph , visit https://discuss.dgraph.io.
To say hi to the community , visit https://dgraph.slack.com.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.
I1117 08:51:27.667733 17 run.go:609] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
I1117 08:51:27.667768 17 run.go:610] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:1 MyAddr:intouch-graph-engine-2.intouch-graph-engine-headless.jx-staging.svc.cluster.local:7080 ZeroAddr:[intouch-graph-engine-zero-0.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080] RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:30000 ProposedGroupId:0 StartTime:2020-11-17 08:51:27.389721543 +0000 UTC m=+0.015988519 LudicrousMode:false BadgerKeyFile:}
I1117 08:51:27.667805 17 run.go:611] worker.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: BadgerCompressionLevel:3 WALDir:w MutationsMode:0 AuthToken: AllottedMemory:2048 HmacSecret:[] AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
I1117 08:51:27.668902 17 server_state.go:75] Setting Badger Compression Level: 3
I1117 08:51:27.668919 17 server_state.go:84] Setting Badger table load option: mmap
I1117 08:51:27.668924 17 server_state.go:96] Setting Badger value log load option: mmap
I1117 08:51:27.668932 17 server_state.go:141] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x282e510 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I1117 08:51:30.757072 17 log.go:34] 25 tables out of 46 opened in 3.081s
I1117 08:51:33.645631 17 log.go:34] All 46 tables opened in 5.969s
I1117 08:51:33.647329 17 log.go:34] Replaying file id: 4781 at offset: 2339515
I1117 08:51:33.647787 17 log.go:34] Replay took: 438.133µs
I1117 08:51:33.648767 17 log.go:34] Replaying file id: 4782 at offset: 0
I1117 08:51:33.665428 17 log.go:34] Replay took: 16.63622ms
I1117 08:51:33.666343 17 log.go:34] Replaying file id: 4783 at offset: 0
I1117 08:51:33.833932 17 log.go:34] Replay took: 167.566051ms
I1117 08:51:33.834879 17 log.go:34] Replaying file id: 4784 at offset: 0
I1117 08:51:33.855436 17 log.go:34] Replay took: 20.53213ms
I1117 08:51:33.856362 17 log.go:34] Replaying file id: 4785 at offset: 0
I1117 08:51:33.877636 17 log.go:34] Replay took: 21.24665ms
I1117 08:51:33.893383 17 log.go:34] Replaying file id: 4786 at offset: 0
I1117 08:51:33.921597 17 log.go:34] Replay took: 28.183555ms
I1117 08:51:33.922599 17 log.go:34] Replaying file id: 4787 at offset: 0
I1117 08:51:33.940804 17 log.go:34] Replay took: 18.182001ms
I1117 08:51:33.941539 17 server_state.go:75] Setting Badger Compression Level: 3
I1117 08:51:33.941558 17 server_state.go:84] Setting Badger table load option: mmap
I1117 08:51:33.941563 17 server_state.go:96] Setting Badger value log load option: mmap
I1117 08:51:33.941568 17 server_state.go:160] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x282e510 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I1117 08:51:34.260341 17 log.go:34] All 14 tables opened in 295ms
I1117 08:51:34.280837 17 log.go:34] Replaying file id: 176 at offset: 56629863
I1117 08:51:35.717268 17 log.go:34] Replay took: 1.436400243s
I1117 08:51:35.718157 17 log.go:34] Replaying file id: 177 at offset: 0
I1117 08:51:40.104149 17 log.go:34] Replay took: 4.385968142s
I1117 08:51:40.105106 17 log.go:34] Replaying file id: 178 at offset: 0
I1117 08:51:47.335873 17 log.go:34] Replay took: 7.230743178s
I1117 08:51:47.336813 17 log.go:34] Replaying file id: 179 at offset: 0
I1117 08:51:52.373200 17 log.go:34] Replay took: 5.036361883s
I1117 08:51:52.424014 17 groups.go:107] Current Raft Id: 0x3
I1117 08:51:52.424347 17 worker.go:96] Worker listening at address: [::]:7080
I1117 08:51:52.424920 17 run.go:480] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I1117 08:51:52.424940 17 run.go:481] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I1117 08:51:52.424967 17 run.go:512] gRPC server started. Listening on port 9080
I1117 08:51:52.424977 17 run.go:513] HTTP server started. Listening on port 8080
I1117 08:51:52.524413 17 pool.go:160] CONNECTING to intouch-graph-engine-zero-0.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080
I1117 08:51:52.533746 17 groups.go:135] Connected to group zero. Assigned group: 0
I1117 08:51:52.533771 17 groups.go:137] Raft Id after connection to Zero: 0x3
I1117 08:51:52.533823 17 pool.go:160] CONNECTING to intouch-graph-engine-1.intouch-graph-engine-headless.jx-staging.svc.cluster.local:7080
I1117 08:51:52.533847 17 pool.go:160] CONNECTING to intouch-graph-engine-0.intouch-graph-engine-headless.jx-staging.svc.cluster.local:7080
I1117 08:51:52.533878 17 pool.go:160] CONNECTING to intouch-graph-engine-zero-1.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080
I1117 08:51:52.533920 17 pool.go:160] CONNECTING to intouch-graph-engine-zero-2.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080
I1117 08:51:52.534021 17 draft.go:200] Node ID: 0x3 with GroupID: 1
I1117 08:51:52.534080 17 node.go:148] Setting raft.Config to: &{ID:3 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc090def240 Applied:11577659 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x282e510 DisableProposalForwarding:false}
I1117 08:51:52.602751 17 node.go:306] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:11577659 Term:16 XXX_unrecognized:[]}
I1117 08:51:52.603047 17 node.go:317] Found hardstate: {Term:32 Vote:0 Commit:12815892 XXX_unrecognized:[]}
I1117 08:51:55.365446 17 node.go:326] Group 1 found 1238233 entries
I1117 08:51:55.365506 17 draft.go:1544] Restarting node for group: 1
I1117 08:51:55.365592 17 node.go:185] Setting conf state to nodes:1 nodes:2 nodes:3
I1117 08:51:55.366057 17 log.go:34] 3 became follower at term 32
I1117 08:51:55.366137 17 log.go:34] newRaft 3 [peers: [1,2,3], term: 32, commit: 12815892, applied: 11577659, lastindex: 12815892, lastterm: 16]
I1117 08:51:55.366169 17 draft.go:147] Operation started with id: opRollup
I1117 08:51:55.366187 17 groups.go:155] Server is ready
I1117 08:51:55.366267 17 draft.go:962] Found Raft progress: 11597221
I1117 08:51:55.366293 17 groups.go:784] Got address of a Zero leader: intouch-graph-engine-zero-0.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080
I1117 08:51:55.366464 17 groups.go:797] Starting a new membership stream receive from intouch-graph-engine-zero-0.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080.
I1117 08:51:55.369259 17 groups.go:814] Received first state update from Zero: counter:443648 groups:<key:1 value:<members:<key:1 value:<id:1 group_id:1 addr:"intouch-graph-engine-0.intouch-graph-engine-headless.jx-staging.svc.cluster.local:7080" leader:true last_update:1605564628 > > members:<key:2 value:<id:2 group_id:1 addr:"intouch-graph-engine-1.intouch-graph-engine-headless.jx-staging.svc.cluster.local:7080" last_update:1605564625 > > members:<key:3 value:<id:3 group_id:1 addr:"intouch-graph-engine-2.intouch-graph-engine-headless.jx-staging.svc.cluster.local:7080" last_update:1604297281 > > tablets:<key:"assetState" value:<group_id:1 predicate:"assetState" space:1341899 > > tablets:<key:"base64EncodedDefinition" value:<group_id:1 predicate:"base64EncodedDefinition" space:1739112037 > > tablets:<key:"checkedOutBy" value:<group_id:1 predicate:"checkedOutBy" space:81869 > > tablets:<key:"childOf" value:<group_id:1 predicate:"childOf" space:543055 > > tablets:<key:"color" value:<group_id:1 predicate:"color" space:101454 > > tablets:<key:"contentType" value:<group_id:1 predicate:"contentType" space:89587 > > tablets:<key:"created" value:<group_id:1 predicate:"created" space:1197131 > > tablets:<key:"createdBy" value:<group_id:1 predicate:"createdBy" space:1124847 > > tablets:<key:"deliveryMethods" value:<group_id:1 predicate:"deliveryMethods" space:777663 > > tablets:<key:"description" value:<group_id:1 predicate:"description" space:167133 > > tablets:<key:"dgraph.acl.rule" value:<group_id:1 predicate:"dgraph.acl.rule" > > tablets:<key:"dgraph.graphql.schema" value:<group_id:1 predicate:"dgraph.graphql.schema" > > tablets:<key:"dgraph.group.acl" value:<group_id:1 predicate:"dgraph.group.acl" > > tablets:<key:"dgraph.password" value:<group_id:1 predicate:"dgraph.password" > > tablets:<key:"dgraph.rule.permission" value:<group_id:1 predicate:"dgraph.rule.permission" > > tablets:<key:"dgraph.rule.predicate" value:<group_id:1 predicate:"dgraph.rule.predicate" > > tablets:<key:"dgraph.type" value:<group_id:1 predicate:"dgraph.type" space:1713567 > > tablets:<key:"dgraph.user.group" value:<group_id:1 predicate:"dgraph.user.group" > > tablets:<key:"dgraph.xid" value:<group_id:1 predicate:"dgraph.xid" > > tablets:<key:"locks" value:<group_id:1 predicate:"locks" space:35294 > > tablets:<key:"name" value:<group_id:1 predicate:"name" space:11290952223 > > tablets:<key:"nextVersion" value:<group_id:1 predicate:"nextVersion" space:152863 > > tablets:<key:"nodeType" value:<group_id:1 predicate:"nodeType" space:1111689 > > tablets:<key:"ownerId" value:<group_id:1 predicate:"ownerId" space:1076954 > > tablets:<key:"parentOf" value:<group_id:1 predicate:"parentOf" space:57135 > > tablets:<key:"previousVersion" value:<group_id:1 predicate:"previousVersion" space:163574 > > tablets:<key:"processId" value:<group_id:1 predicate:"processId" space:29153 > > tablets:<key:"qaReviewer" value:<group_id:1 predicate:"qaReviewer" space:665 > > tablets:<key:"sourceOf" value:<group_id:1 predicate:"sourceOf" space:1049 > > tablets:<key:"staticId" value:<group_id:1 predicate:"staticId" space:2058190830 > > tablets:<key:"subtype" value:<group_id:1 predicate:"subtype" space:242150 > > tablets:<key:"tenant" value:<group_id:1 predicate:"tenant" space:38437 > > tablets:<key:"textRepresentation" value:<group_id:1 predicate:"textRepresentation" space:2367164 > > tablets:<key:"uid" value:<group_id:1 predicate:"uid" > > tablets:<key:"updated" value:<group_id:1 predicate:"updated" space:1197002 > > tablets:<key:"usedBy" value:<group_id:1 predicate:"usedBy" space:125085 > > tablets:<key:"uses" value:<group_id:1 predicate:"uses" space:249567 > > tablets:<key:"validFrom" value:<group_id:1 predicate:"validFrom" space:2704 > > tablets:<key:"validTo" value:<group_id:1 predicate:"validTo" space:2544 > > tablets:<key:"versionId" value:<group_id:1 predicate:"versionId" space:2334840 > > tablets:<key:"workflowState" value:<group_id:1 predicate:"workflowState" space:711890 > > snapshot_ts:37258482 checksum:12122767884684680226 > > zeros:<key:1 value:<id:1 addr:"intouch-graph-engine-zero-0.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080" leader:true > > zeros:<key:2 value:<id:2 addr:"intouch-graph-engine-zero-1.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080" > > zeros:<key:3 value:<id:3 addr:"intouch-graph-engine-zero-2.intouch-graph-engine-zero-headless.jx-staging.svc.cluster.local:5080" > > maxLeaseId:2054781 maxTxnTs:37450000 maxRaftId:3 cid:"afe29dcb-acf1-44b2-82e6-7c1f3511bdd4" license:<maxNodes:18446744073709551615 expiryTs:1597214731 >
W1117 08:51:59.019720 17 draft.go:1183] Raft.Ready took too long to process: Timer Total: 388ms. Breakdown: [{proposals 388ms} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
I1117 08:52:55.457314 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 08:53:55.432306 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 08:54:55.441072 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 08:55:55.462392 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 08:56:55.477847 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 08:57:55.500429 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 08:58:09.976762 17 log.go:34] raft.node: 3 elected leader 1 at term 32
I1117 08:58:55.564585 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 08:59:55.520407 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:00:54.489529 17 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I1117 09:00:54.489590 17 log.go:34] Running for level: 0
I1117 09:00:55.817463 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:01:11.148732 17 log.go:34] LOG Compact 0->1, del 19 tables, add 14 tables, took 16.659112274s
I1117 09:01:11.148797 17 log.go:34] Compaction for level: 0 DONE
I1117 09:01:55.795079 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:02:55.823026 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:03:16.090302 17 draft.go:822] Blocked pushing to applyCh for 11m16.611s
W1117 09:03:16.122066 17 draft.go:1183] Raft.Ready took too long to process: Timer Total: 11m16.873s. Breakdown: [{proposals 11m16.873s} {disk 0s} {advance 0s}] Num entries: 0. MustSync: false
I1117 09:03:55.828642 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:04:55.783930 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:05:55.898158 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:06:19.182181 17 run.go:652] Caught Ctrl-C. Terminating now (this may take a few seconds)...
E1117 09:06:19.182380 17 run.go:410] Stopped taking more http(s) requests. Err: accept tcp [::]:8080: use of closed network connection
E1117 09:06:19.182296 17 run.go:391] GRPC listener canceled: accept tcp [::]:9080: use of closed network connection
I1117 09:06:19.183192 17 run.go:675] GRPC and HTTP stopped.
I1117 09:06:19.183200 17 worker.go:112] Stopping group...
I1117 09:06:19.183354 17 worker.go:116] Updating RAFT state before shutting down...
E1117 09:06:19.183651 17 groups.go:805] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
I1117 09:06:19.677330 17 draft.go:1466] Skipping snapshot at index: 11597221. Insufficient discard entries: 1. MinPendingStartTs: 33278617
I1117 09:06:19.677365 17 worker.go:121] Stopping node...
I1117 09:06:19.677417 17 draft.go:177] Stopped all ongoing registered tasks.
I1117 09:06:19.677431 17 draft.go:112] Operation completed with id: opRollup
I1117 09:06:19.677435 17 draft.go:904] Stopping node.Run
W1117 09:06:19.776534 17 raft_server.go:239] Error while raft.Step from 0x1: raft: stopped. Closing RaftMessage stream.
...
It runs under k8s orchestration so we assume that k8s restarts the node as it becomes unresponsive. Any idea why it can sync with rest of the Dgraph cluster nodes? And what else we can do except exporting data and setting a new cluster from scratch?
I run the following commands, if you would like to receive them let please let me know
curl http://localhost:8080/debug/pprof/heap > heap.$(date +%s).pprof
curl http://localhost:8080/debug/pprof/profile > cpu.$(date +%s).pprof
curl http://localhost:8080/debug/pprof/blo > blo.$(date +%s).pprof
curl http://localhost:8080/debug/pprof/goroutine?debug=2 > routine.txt
When I performed the following a read-only query against that broken node
{
q(func:eq(nodeType,"realm")){
uid
}
}
i got such response
{
"name": "t",
"url": "http://localhost:8080/query?timeout=20s&ro=true&be=true",
"errors": [
{
"message": ": cannot retrieve UIDs from list with key 0000086e6f646554797065020bd9993d51267025c348b0c14197906ae6633d88d4edcb593ae9801e90e9ef1af4: readTs: 33504374 less than minTs: 37136412 for key: \"\\x00\\x00\\bnodeType\\x02\\vÙ™=Q&p%\\xc3H\\xb0\\xc1A\\x97\\x90j\\xe6c=\\x88\\xd4\\xed\\xcbY:\\xe9\\x80\\x1e\\x90\\xe9\\xef\\x1a\\xf4\"",
"extensions": {
"code": "ErrorInvalidRequest"
}
}
]
}
does it mean that the data is corrupted on the disk?
Most helpful comment
is there any body could help me?