Dgraph: Data missing in Dgraph cluster after bulk loading

Created on 15 Feb 2018 · 19Comments · Source: dgraph-io/dgraph

DGraph Version - 1.0.3

OS - Centos 7

Steps to reproduce the issue

Started zero server
_nohup dgraph zero --my=10.111.111.101:5080 --replicas=3 --idx=01 &_

Ran bulk loader
_dgraph bulk -r /home/mapr/DgraphResources/sample.rdf -s /home/mapr/DgraphResources/sample.schema --map_shards=6 --reduce_shards=3 -z 10.111.111.101:5080_

sample.schema.txt
sample.rdf.txt

Copied the p folders to all 3 nodes (/opt/dgraph/data)
Started dgraph server in 3 nodes (inside /opt/dgraph/data)
_nohup dgraph server --memory_mb=16000 --my=10.111.111.101:7080 --zero=10.111.111.101:5080 &
nohup dgraph server --memory_mb=16000 --my=10.111.111.104:7080 --zero=10.111.111.101:5080 &
nohup dgraph server --memory_mb=16000 --my=10.111.111.107:7080 --zero=10.111.111.101:5080 &_
- Expected behaviour
  I will get result if I query any of the 3 nodes (using IP:port/query API)
- Actual behaviour
  Only one node returns partial result. whereas when I query other 2 nodes , I am getting empty result.

End point - http://10.111.111.107:8080/query?debug=true
Post Body - query.txt
Response - response.txt

Result of cluster state api - /state
cluster state.txt

Node
I initially tried with ~50M edges, faced same issue.

Please help me to understand what is missing!.

kinbug

Source

veludurai106

All 19 comments

I would suggest running bulk loader with --reduce_shards as 1, since you want the 3 Dgraph servers to be replicas of each other. Then give the out/0/p as input to one of the Dgraph servers while starting it.

nohup dgraph server --memory_mb=16000 --my=10.111.111.101:7080 --zero=10.111.111.101:5080 -p out/0/p &

The other replica Dgraph servers would get using the snapshot on starting. I will investigate this issue anyway.

pawanrawal on 19 Feb 2018

I am going to close the issue. The issue about a node returning partial result should not occur and I couldn't reproduce it. I have opened another issue about not fetching the snapshot if the initial data in p directory is same for replicas. https://github.com/dgraph-io/dgraph/issues/2202

Feel free to reopen this issue if you still encounter it.

pawanrawal on 7 Mar 2018

Hi @pawanrawal , I was able to reproduce the same issue. I wanted to start dgraph with one zero and 3 servers, I do not want 3 replicas however. I'm only interested in data distribution. Hence I started zero with 1 replica and 3 servers with each pointing to the corresponding reduce shard of bulk output p directory.
When I query, only one of the servers gives partial predicates, rest do not show any predicates at all.

sriharshaboda on 30 May 2018

So, basically, 3 reduce shards. 3 Dgraph servers. Each pointing to a corresponding shard. And you're not able to get full results as expected?

Can you share the commands you used for bulk loading, and then for running the servers?

manishrjain on 30 May 2018

dgraph zero
dgraph bulk -r ~/dgraph/21million.rdf.gz -s ~/dgraph/21million.schema --map_shards=6 --reduce_shards=3 --http localhost:8765 --zero=localhost:5080

ls out 
0  1  2

cd out/0
dgraph server --lru_mb 2048 --zero localhost:5080

cd out/1
dgraph server --lru_mb 2048 --zero localhost:5080 -o 200

cd out/2
dgraph server --lru_mb 2048 --zero localhost:5080 -o 400

Attaching the state, where you can see that only a few predicates occupy a decent space.
state.txt

sriharshaboda on 31 May 2018

Hey @sriharshaboda , you are not using ( --my=server ) in your commands so it may be able to talk to just one server..
Just try these commands and let me know if you find any issues working with it.

cd out/0
dgraph server --my=server_1:7080 --lru_mb=2048 --zero=zero:5080

cd out/1
dgraph server --my=server_2:7280 --lru_mb=2048 --zero=zero:5080 -o 200

cd out/2
dgraph server --my=server_3:7480 --lru_mb=2048 --zero=zero:5080 -o 400

ghost on 1 Jun 2018

Hi @karan28aug , I've tried with the my option enabled, but that didn't change anything. I still see missing predicates. Please let me know if I should be trying something else instead.

sriharshaboda on 1 Jun 2018

I can confirm that this is an issue. Needs more investigation.

In the meantime, I'd say use bulk loader with just one reduce shard. Dgraph Zero would automatically move predicates around to empty servers anyway. So, that can help even out the data among the 3 Dgraph servers.

manishrjain on 2 Jun 2018

@manishrjain I would like to take a stab at this issue. Is there any pointers that you can give me? Packages / files / tests that might be related to this issue, that I can look at?

ashwanthkumar on 5 Jun 2018

I think the biggest thing to look at here is to ensure that the bulk loader is correctly outputting two Badger directories with different predicates.

What I'd do is to run bulk loader with 2 reduce shards, and then write a Badger script to iterate over the keys, and output which predicates they belong to (using x.ParseKey). That'd be a good way to determine what gets generated, and how the data gets split.

manishrjain on 5 Jun 2018

👍1

Thanks for the pointer @manishrjain -- I would look into it and get back with my observations.

ashwanthkumar on 5 Jun 2018

Figured what the issue was. Bulk loader was writing out the schema for all predicates in all the reduce shards. Therefore, both the Dgraph servers were telling Zero that they "owned" that predicate. So, the other was losing out basically.

The fix should be simple, only write schema to the DB which has that predicate's data. So, there's a clear split between the predicates. I'll submit a fix next week, and try and make a new release.

manishrjain on 9 Jun 2018

👍1

@ashwanthkumar @sriharshaboda Can you try again from master?

manishrjain on 20 Jun 2018

👍1

Hi @manishrjain I'm using
dgraph version : v1.0.11
Commit SHA-1 : b2a09c5b
Commit timestamp : 2018-12-17 09:50:56 -0800
Branch : HEAD
Go version : go1.11.1

And I see the same issue. Can you please confirm if the bug fix is included in the release I'm using?

EshwarSR on 22 Feb 2019

How to solve this problem?

FelixHolmes on 2 Mar 2019

Try from Master. I've tested it and is working. The fix isn't in any release. Use Master only to bulk, after that use your usual version.

Dgraph version   : v1.0.12-rc3-152-g81ff46e0
Commit SHA-1     : 81ff46e0
Commit timestamp : 2019-02-26 09:57:40 -0800
Branch           : master
Go version       : go1.11.5

PS. Or try v1.0.12-rc7 it has the latest fix for this.

MichelDiz on 2 Mar 2019

What's the latest solution? Can you write a document or blog about the specific steps of bulk loading? Because many people have the same problem, I am actively trying to use it in a production environment, but it has been delayed for a long time.

FelixHolmes on 6 Mar 2019

V1.0.12 is out. That has the fix.

manishrjain on 6 Mar 2019

Closing old ticket.