Dgraph: Export count mismatch after predicate move

Created on 24 Jan 2018 · 16Comments · Source: dgraph-io/dgraph

Loaded 21 million RDF's using the bulk loader with reduce_shards as 3. Then started up three Dgraph instances. The export count is correct initially but starts going down after predicate move has happened.

kinbug

Source

pawanrawal

Most helpful comment

This was caused by three different small issues which have been fixed in 7131e04a3090f389868a7f938fcb6c4954e21f31, 6da3c7d73960ad99d227c81e60d027590344c962 and 85af112124dfe2c582f13065cc18dab16bbaf094.

pawanrawal on 29 Jan 2018

❤2

All 16 comments

pawanrawal on 29 Jan 2018

❤2

I think the issue is not resolved.

Dgraph version   : v1.0.2-dev
Commit SHA-1     : 2fade242
Commit timestamp : 2018-01-31 23:07:30 +1100
Branch           : HEAD

Cluster configuration: Docker swarm, 3 server nodes, 3 zero nodes, replication x3. docker-compose.yml:

version: "3.4"

services:
  zero-1:
    image: dgraph/dgraph:master
    hostname: "zero-1"
    command: dgraph zero --my=zero-1:5080 --replicas 3 --idx 1
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    ports:
      - 6080:6080
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-1

  zero-2:
    image: dgraph/dgraph:master
    hostname: "zero-2"
    command: dgraph zero --my=zero-2:5080 --replicas 3 --idx 2 --peer zero-1:5080
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-2

  zero-3:
    image: dgraph/dgraph:master
    hostname: "zero-3"
    command: dgraph zero --my=zero-3:5080 --replicas 3 --idx 3 --peer zero-1:5080
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == swarm-manager-3

  server-1:
    image: dgraph/dgraph:master
    hostname: "server-1"
    command: dgraph server --my=server-1:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    ports:
      - 8080:8080
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-1

  server-2:
    image: dgraph/dgraph:master
    hostname: "server-2"
    command: dgraph server --my=server-2:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    networks:
      - dgraph
    ports:
      - 8081:8080
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-2

  server-3:
    image: dgraph/dgraph:master
    hostname: "server-3"
    command: dgraph server --my=server-3:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
    volumes:
      - data:/dgraph
    ports:
      - 8082:8080
    networks:
      - dgraph
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == swarm-manager-3

  ratel:
    image: dgraph/dgraph:master
    command: dgraph-ratel
    networks:
      - dgraph
    ports:
      - 18049:8081

networks:
  dgraph:
    external: true

volumes:
  data:

Steps to reproduce:

[root@swarm-manager-1 ~]# docker stack deploy dgraph --compose-file=/root/dgraph/docker-compose-master.yml
[root@swarm-manager-1 ~]# wget "https://github.com/dgraph-io/tutorial/blob/master/resources/1million.rdf.gz?raw=true" -O 1million.rdf.gz -q ; gunzip 1million.rdf.gz
[root@swarm-manager-1 ~]# grep -P "\t\t<" 1million.rdf | head -n 100000  > 100k.rdf ; gzip 100k.rdf # create a subset to load the data faster, no empty lines or comments
[root@swarm-manager-1 ~]# cp 100k.rdf.gz /var/lib/docker/volumes/dgraph_data/_data/
[root@swarm-manager-1 ~]# SERVER_ID=`docker ps | grep server | awk '{print $1}'`
[root@swarm-manager-1 ~]# docker exec $SERVER_ID dgraph live -r 100k.rdf.gz --zero=zero-1:5080

Processing 100k.rdf.gz
Total Txns done:        0 RDFs per second:       0 Time Elapsed: 2s, Aborts: 0
Total Txns done:        0 RDFs per second:       0 Time Elapsed: 4s, Aborts: 0
Total Txns done:        0 RDFs per second:       0 Time Elapsed: 6s, Aborts: 0
Total Txns done:        0 RDFs per second:       0 Time Elapsed: 8s, Aborts: 0
Total Txns done:        0 RDFs per second:       0 Time Elapsed: 10s, Aborts: 0
Total Txns done:        0 RDFs per second:    2499 Time Elapsed: 12s, Aborts: 0
Total Txns done:        2 RDFs per second:    3570 Time Elapsed: 14s, Aborts: 0
Total Txns done:        3 RDFs per second:    3749 Time Elapsed: 16s, Aborts: 0
Total Txns done:        4 RDFs per second:    3888 Time Elapsed: 18s, Aborts: 0
Total Txns done:        6 RDFs per second:    4499 Time Elapsed: 20s, Aborts: 0
Total Txns done:        7 RDFs per second:    4544 Time Elapsed: 22s, Aborts: 0
Total Txns done:        7 RDFs per second:    4166 Time Elapsed: 24s, Aborts: 0
Total Txns done:        9 RDFs per second:    3846 Time Elapsed: 26s, Aborts: 0
Number of mutations run   : 10                                                                      
Number of RDFs processed  : 100000
Time spent                : 26.613957969s
RDFs processed per second : 3846
Total Txns done:       10 RDFs per second:    3571 Time Elapsed: 28s, Aborts: 0
[root@swarm-manager-1 ~]# gunzip 100k.rdf.gz 
[root@swarm-manager-1 ~]# wc -l 100k.rdf 
100000 100k.rdf
# install curl in the server container at this point
[root@swarm-manager-1 ~]# docker exec $SERVER_ID curl "localhost:8080/admin/export"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0{"code": "Success", "message": "Export comple100    51  100    51    0     0     18      0  0:00:02  0:00:02 --:--:--    18
# find out the leader of the cluster, check if export files are there
# in this case the leader is swarm-manager-3
[root@swarm-manager-3 export]# cd /var/lib/docker/volumes/dgraph_data/_data/export/
[root@swarm-manager-3 export]# ls -lha
total 976K
drwx------ 2 root root   89 Jan 31 16:10 .
drwxr-xr-x 6 root root   48 Jan 31 16:10 ..
-rw-r--r-- 1 root root 971K Jan 31 16:10 dgraph-1-2018-01-31-13-10.rdf.gz
-rw-r--r-- 1 root root  125 Jan 31 16:10 dgraph-1-2018-01-31-13-10.schema.gz
[root@swarm-manager-3 export]# gunzip dgraph-1-2018-01-31-13-10.rdf.gz 
[root@swarm-manager-3 export]# wc -l dgraph-1-2018-01-31-13-10.rdf 
99780 dgraph-1-2018-01-31-13-10.rdf

220 entries are missing when 100,000 entries were submitted. I was able to reproduce the issue three times in a row, but on the fourth try export was complete. Each try was ran without any data in volumes (removed via docker volume rm dgraph_data), stack was removed and deployed again. I wasn't able to figure out what data is missing due to the differences between input and output, but the amount of missing data is always the same. Adding a schema after live loader doesn't change anything, data is still missing. Alter and export were performed about 10 minutes after import.

# endpoint /alter, modified via ratel

director.film: uid @reverse @count .
genre: uid @reverse .
initial_release_date: dateTime @index(year) .
name: string @index(term) .
starring: uid @count .

....

# perform the export, check the results
[root@swarm-manager-1 export]# wc -l dgraph-1-2018-01-31-13-48.rdf 
99780 dgraph-1-2018-01-31-13-48.rdf

unknown321 on 31 Jan 2018

+1 to issue not solved.
I still hit predicate undefined error after predicates are moved.
@pawanrawal might be considering reopen this issue.

jzhu077 on 31 Jan 2018

@unknown321 While performing an export, we go over the data that's on disk and all keys might not have made it to disk. So if you restarted the server (which flushes the keys to disk), your export should return the correct count. I'll change the implementation to use a new iterator which combines the keys in memory with those on disk. I have created another issue for this https://github.com/dgraph-io/dgraph/issues/2070.

@jzhu077 I can look at the code and try to see whats going on. Any more pointers that you can provide would be helpful (some code to reproduce it maybe).

pawanrawal on 1 Feb 2018

@pawanrawal I still having the same issue. Basically, what I did is to set up the dgraph cluster in kubernetes, 5 zeros and 30 servers (dgraph/dgraph:master) divided into 10 groups, insert some data, alter the schema and run some queries, all working as expected. I leave it running as it is for a day, the next day when I try to run the same query that was working a day ago, I get predicate foobar undefined I have tested quite a few days and I get the same error every day. It doesn't occur straight away after the predicate move all the time, but if you leave it long enough (a day) it will.

jzhu077 on 5 Feb 2018

Keen to resolve this as soon as possible. I would need the answers to some questions.

Could you share the exact full error that you get? Also, do you see the predicate when you do the schema{} query or in the tablets section at http://localhost:6080/state? Do you know if it only happens for predicates that were moved or for other predicates as well? If you try the query multiple times (3-4 times), do you always get this error once it starts occurring?

pawanrawal on 5 Feb 2018

Could you share the exact full error that you get?

It's either in this format : rpc error: code = Unknown desc = Got error: Schema not defined for predicate: ___kind. while running: name:"eq" args:"foobar" or Predicate ___older_sibling doesn't have reverse edge

do you see the predicate when you do the schema{} query

Yes,

    {
        "predicate": "___kind",
        "type": "string",
        "index": true,
        "tokenizer": [
          "exact"
        ]
      },
      {
        "predicate": "___older_sibling",
        "type": "uid",
        "reverse": true
      },

Do you know if it only happens for predicates that were moved or for other predicates as well?

No sure, the log doesn't tell which predicate is being moved.

2018/02/02 15:31:44 predicate_move.go:52: Writing 328490 keys
2018/02/02 15:31:47 predicate_move.go:52: Writing 20866 keys
2018/02/02 15:32:27 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[10 11 12] XXX_unrecognized:[]} Index:9003 Term:2 XXX_unrecognized:[]}, len(data): 73

If you try the query multiple times (3-4 times), do you always get this error once it starts occurring?

Yes, I always get one of the errors as I stated in the answer to the first question.

jzhu077 on 5 Feb 2018

@pawanrawal are you able to reproduce the issue?

jzhu077 on 6 Feb 2018

I will give it a try today.

No sure, the log doesn't tell which predicate is being moved.

Logs for Zero leader should tell you this.

pawanrawal on 6 Feb 2018

@jzhu077 I was able to reproduce what I think is the issue. After predicate move, the schema is somehow not updated on replicas of a group and is only updated on the leader. A query to the follower in a group returns an error but works fine when sent to the leader. Working on a fix.

pawanrawal on 7 Feb 2018

Have a fix in https://github.com/dgraph-io/dgraph/pull/2096 which I will get merged today and update here. Does the query work fine sometimes and not at other times?

pawanrawal on 7 Feb 2018

The nightly build has been updated. Can you pull the latest dgraph/dgraph:master image and try again @jzhu077?

pawanrawal on 7 Feb 2018

@pawanrawal The the query always failed, as it either missing the first or the second predicate in the query I ran. sorry I haven't tested a single block query which might work sometimes if what you found is correct. I will give it a try today. Thanks for the timely update :+1:

jzhu077 on 7 Feb 2018

Whats your imagePullPolicy setting your config file? I am wondering if all the nodes are actually using the latest master image? There could still be an issue with the schema but just want to make sure you are using the right image.

pawanrawal on 7 Feb 2018

I set it to always, so it always pulls the latest image. Looks like the issue has been fixed.

jzhu077 on 9 Feb 2018

Cool, I am closing this. Feel free to open if you encounter the issue again.

pawanrawal on 12 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings