Dgraph: Data missing in Dgraph cluster after bulk loading

Created on 15 Feb 2018  路  19Comments  路  Source: dgraph-io/dgraph

DGraph Version - 1.0.3

OS - Centos 7

Steps to reproduce the issue

  • Started zero server
    _nohup dgraph zero --my=10.111.111.101:5080 --replicas=3 --idx=01 &_
  • Ran bulk loader
    _dgraph bulk -r /home/mapr/DgraphResources/sample.rdf -s /home/mapr/DgraphResources/sample.schema --map_shards=6 --reduce_shards=3 -z 10.111.111.101:5080_

sample.schema.txt
sample.rdf.txt

  • Copied the p folders to all 3 nodes (/opt/dgraph/data)

  • Started dgraph server in 3 nodes (inside /opt/dgraph/data)
    _nohup dgraph server --memory_mb=16000 --my=10.111.111.101:7080 --zero=10.111.111.101:5080 &
    nohup dgraph server --memory_mb=16000 --my=10.111.111.104:7080 --zero=10.111.111.101:5080 &
    nohup dgraph server --memory_mb=16000 --my=10.111.111.107:7080 --zero=10.111.111.101:5080 &_

    • Expected behaviour
      I will get result if I query any of the 3 nodes (using IP:port/query API)

    • Actual behaviour
      Only one node returns partial result. whereas when I query other 2 nodes , I am getting empty result.

End point - http://10.111.111.107:8080/query?debug=true
Post Body - query.txt
Response - response.txt

Result of cluster state api - /state
cluster state.txt

Node
I initially tried with ~50M edges, faced same issue.

Please help me to understand what is missing!.

kinbug

All 19 comments

I would suggest running bulk loader with --reduce_shards as 1, since you want the 3 Dgraph servers to be replicas of each other. Then give the out/0/p as input to one of the Dgraph servers while starting it.

nohup dgraph server --memory_mb=16000 --my=10.111.111.101:7080 --zero=10.111.111.101:5080 -p out/0/p &

The other replica Dgraph servers would get using the snapshot on starting. I will investigate this issue anyway.

I am going to close the issue. The issue about a node returning partial result should not occur and I couldn't reproduce it. I have opened another issue about not fetching the snapshot if the initial data in p directory is same for replicas. https://github.com/dgraph-io/dgraph/issues/2202

Feel free to reopen this issue if you still encounter it.

Hi @pawanrawal , I was able to reproduce the same issue. I wanted to start dgraph with one zero and 3 servers, I do not want 3 replicas however. I'm only interested in data distribution. Hence I started zero with 1 replica and 3 servers with each pointing to the corresponding reduce shard of bulk output p directory.
When I query, only one of the servers gives partial predicates, rest do not show any predicates at all.

So, basically, 3 reduce shards. 3 Dgraph servers. Each pointing to a corresponding shard. And you're not able to get full results as expected?

Can you share the commands you used for bulk loading, and then for running the servers?

dgraph zero
dgraph bulk -r ~/dgraph/21million.rdf.gz -s ~/dgraph/21million.schema --map_shards=6 --reduce_shards=3 --http localhost:8765 --zero=localhost:5080

ls out 
0  1  2

cd out/0
dgraph server --lru_mb 2048 --zero localhost:5080

cd out/1
dgraph server --lru_mb 2048 --zero localhost:5080 -o 200

cd out/2
dgraph server --lru_mb 2048 --zero localhost:5080 -o 400

Attaching the state, where you can see that only a few predicates occupy a decent space.
state.txt

Hey @sriharshaboda , you are not using ( --my=server ) in your commands so it may be able to talk to just one server..
Just try these commands and let me know if you find any issues working with it.

cd out/0
dgraph server --my=server_1:7080 --lru_mb=2048 --zero=zero:5080

cd out/1
dgraph server --my=server_2:7280 --lru_mb=2048 --zero=zero:5080 -o 200

cd out/2
dgraph server --my=server_3:7480 --lru_mb=2048 --zero=zero:5080 -o 400

Hi @karan28aug , I've tried with the my option enabled, but that didn't change anything. I still see missing predicates. Please let me know if I should be trying something else instead.

I can confirm that this is an issue. Needs more investigation.

In the meantime, I'd say use bulk loader with just one reduce shard. Dgraph Zero would automatically move predicates around to empty servers anyway. So, that can help even out the data among the 3 Dgraph servers.

@manishrjain I would like to take a stab at this issue. Is there any pointers that you can give me? Packages / files / tests that might be related to this issue, that I can look at?

I think the biggest thing to look at here is to ensure that the bulk loader is correctly outputting two Badger directories with different predicates.

What I'd do is to run bulk loader with 2 reduce shards, and then write a Badger script to iterate over the keys, and output which predicates they belong to (using x.ParseKey). That'd be a good way to determine what gets generated, and how the data gets split.

Thanks for the pointer @manishrjain -- I would look into it and get back with my observations.

Figured what the issue was. Bulk loader was writing out the schema for all predicates in all the reduce shards. Therefore, both the Dgraph servers were telling Zero that they "owned" that predicate. So, the other was losing out basically.

The fix should be simple, only write schema to the DB which has that predicate's data. So, there's a clear split between the predicates. I'll submit a fix next week, and try and make a new release.

@ashwanthkumar @sriharshaboda Can you try again from master?

Hi @manishrjain I'm using
dgraph version : v1.0.11
Commit SHA-1 : b2a09c5b
Commit timestamp : 2018-12-17 09:50:56 -0800
Branch : HEAD
Go version : go1.11.1

And I see the same issue. Can you please confirm if the bug fix is included in the release I'm using?

How to solve this problem?

Try from Master. I've tested it and is working. The fix isn't in any release. Use Master only to bulk, after that use your usual version.

Dgraph version   : v1.0.12-rc3-152-g81ff46e0
Commit SHA-1     : 81ff46e0
Commit timestamp : 2019-02-26 09:57:40 -0800
Branch           : master
Go version       : go1.11.5

PS. Or try v1.0.12-rc7 it has the latest fix for this.

What's the latest solution? Can you write a document or blog about the specific steps of bulk loading? Because many people have the same problem, I am actively trying to use it in a production environment, but it has been delayed for a long time.

V1.0.12 is out. That has the fix.

Closing old ticket.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andrewsmedina picture andrewsmedina  路  4Comments

captain-me0w picture captain-me0w  路  4Comments

yupengfei picture yupengfei  路  4Comments

pjebs picture pjebs  路  4Comments

jeffkhull picture jeffkhull  路  3Comments