Loaded 21 million RDF's using the bulk loader with reduce_shards as 3. Then started up three Dgraph instances. The export count is correct initially but starts going down after predicate move has happened.
This was caused by three different small issues which have been fixed in 7131e04a3090f389868a7f938fcb6c4954e21f31, 6da3c7d73960ad99d227c81e60d027590344c962 and 85af112124dfe2c582f13065cc18dab16bbaf094.
I think the issue is not resolved.
Dgraph version : v1.0.2-dev
Commit SHA-1 : 2fade242
Commit timestamp : 2018-01-31 23:07:30 +1100
Branch : HEAD
Cluster configuration: Docker swarm, 3 server nodes, 3 zero nodes, replication x3. docker-compose.yml:
version: "3.4"
services:
zero-1:
image: dgraph/dgraph:master
hostname: "zero-1"
command: dgraph zero --my=zero-1:5080 --replicas 3 --idx 1
volumes:
- data:/dgraph
networks:
- dgraph
ports:
- 6080:6080
deploy:
placement:
constraints:
- node.hostname == swarm-manager-1
zero-2:
image: dgraph/dgraph:master
hostname: "zero-2"
command: dgraph zero --my=zero-2:5080 --replicas 3 --idx 2 --peer zero-1:5080
volumes:
- data:/dgraph
networks:
- dgraph
deploy:
placement:
constraints:
- node.hostname == swarm-manager-2
zero-3:
image: dgraph/dgraph:master
hostname: "zero-3"
command: dgraph zero --my=zero-3:5080 --replicas 3 --idx 3 --peer zero-1:5080
volumes:
- data:/dgraph
networks:
- dgraph
deploy:
placement:
constraints:
- node.hostname == swarm-manager-3
server-1:
image: dgraph/dgraph:master
hostname: "server-1"
command: dgraph server --my=server-1:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
volumes:
- data:/dgraph
networks:
- dgraph
ports:
- 8080:8080
deploy:
replicas: 1
placement:
constraints:
- node.hostname == swarm-manager-1
server-2:
image: dgraph/dgraph:master
hostname: "server-2"
command: dgraph server --my=server-2:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
volumes:
- data:/dgraph
networks:
- dgraph
ports:
- 8081:8080
deploy:
replicas: 1
placement:
constraints:
- node.hostname == swarm-manager-2
server-3:
image: dgraph/dgraph:master
hostname: "server-3"
command: dgraph server --my=server-3:7080 --memory_mb=1568 --zero=zero-1:5080 --export=/dgraph/export
volumes:
- data:/dgraph
ports:
- 8082:8080
networks:
- dgraph
deploy:
replicas: 1
placement:
constraints:
- node.hostname == swarm-manager-3
ratel:
image: dgraph/dgraph:master
command: dgraph-ratel
networks:
- dgraph
ports:
- 18049:8081
networks:
dgraph:
external: true
volumes:
data:
Steps to reproduce:
[root@swarm-manager-1 ~]# docker stack deploy dgraph --compose-file=/root/dgraph/docker-compose-master.yml
[root@swarm-manager-1 ~]# wget "https://github.com/dgraph-io/tutorial/blob/master/resources/1million.rdf.gz?raw=true" -O 1million.rdf.gz -q ; gunzip 1million.rdf.gz
[root@swarm-manager-1 ~]# grep -P "\t\t<" 1million.rdf | head -n 100000 > 100k.rdf ; gzip 100k.rdf # create a subset to load the data faster, no empty lines or comments
[root@swarm-manager-1 ~]# cp 100k.rdf.gz /var/lib/docker/volumes/dgraph_data/_data/
[root@swarm-manager-1 ~]# SERVER_ID=`docker ps | grep server | awk '{print $1}'`
[root@swarm-manager-1 ~]# docker exec $SERVER_ID dgraph live -r 100k.rdf.gz --zero=zero-1:5080
Processing 100k.rdf.gz
Total Txns done: 0 RDFs per second: 0 Time Elapsed: 2s, Aborts: 0
Total Txns done: 0 RDFs per second: 0 Time Elapsed: 4s, Aborts: 0
Total Txns done: 0 RDFs per second: 0 Time Elapsed: 6s, Aborts: 0
Total Txns done: 0 RDFs per second: 0 Time Elapsed: 8s, Aborts: 0
Total Txns done: 0 RDFs per second: 0 Time Elapsed: 10s, Aborts: 0
Total Txns done: 0 RDFs per second: 2499 Time Elapsed: 12s, Aborts: 0
Total Txns done: 2 RDFs per second: 3570 Time Elapsed: 14s, Aborts: 0
Total Txns done: 3 RDFs per second: 3749 Time Elapsed: 16s, Aborts: 0
Total Txns done: 4 RDFs per second: 3888 Time Elapsed: 18s, Aborts: 0
Total Txns done: 6 RDFs per second: 4499 Time Elapsed: 20s, Aborts: 0
Total Txns done: 7 RDFs per second: 4544 Time Elapsed: 22s, Aborts: 0
Total Txns done: 7 RDFs per second: 4166 Time Elapsed: 24s, Aborts: 0
Total Txns done: 9 RDFs per second: 3846 Time Elapsed: 26s, Aborts: 0
Number of mutations run : 10
Number of RDFs processed : 100000
Time spent : 26.613957969s
RDFs processed per second : 3846
Total Txns done: 10 RDFs per second: 3571 Time Elapsed: 28s, Aborts: 0
[root@swarm-manager-1 ~]# gunzip 100k.rdf.gz
[root@swarm-manager-1 ~]# wc -l 100k.rdf
100000 100k.rdf
# install curl in the server container at this point
[root@swarm-manager-1 ~]# docker exec $SERVER_ID curl "localhost:8080/admin/export"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0{"code": "Success", "message": "Export comple100 51 100 51 0 0 18 0 0:00:02 0:00:02 --:--:-- 18
# find out the leader of the cluster, check if export files are there
# in this case the leader is swarm-manager-3
[root@swarm-manager-3 export]# cd /var/lib/docker/volumes/dgraph_data/_data/export/
[root@swarm-manager-3 export]# ls -lha
total 976K
drwx------ 2 root root 89 Jan 31 16:10 .
drwxr-xr-x 6 root root 48 Jan 31 16:10 ..
-rw-r--r-- 1 root root 971K Jan 31 16:10 dgraph-1-2018-01-31-13-10.rdf.gz
-rw-r--r-- 1 root root 125 Jan 31 16:10 dgraph-1-2018-01-31-13-10.schema.gz
[root@swarm-manager-3 export]# gunzip dgraph-1-2018-01-31-13-10.rdf.gz
[root@swarm-manager-3 export]# wc -l dgraph-1-2018-01-31-13-10.rdf
99780 dgraph-1-2018-01-31-13-10.rdf
220 entries are missing when 100,000 entries were submitted. I was able to reproduce the issue three times in a row, but on the fourth try export was complete. Each try was ran without any data in volumes (removed via docker volume rm dgraph_data), stack was removed and deployed again. I wasn't able to figure out what data is missing due to the differences between input and output, but the amount of missing data is always the same. Adding a schema after live loader doesn't change anything, data is still missing. Alter and export were performed about 10 minutes after import.
# endpoint /alter, modified via ratel
director.film: uid @reverse @count .
genre: uid @reverse .
initial_release_date: dateTime @index(year) .
name: string @index(term) .
starring: uid @count .
....
# perform the export, check the results
[root@swarm-manager-1 export]# wc -l dgraph-1-2018-01-31-13-48.rdf
99780 dgraph-1-2018-01-31-13-48.rdf
+1 to issue not solved.
I still hit predicate undefined error after predicates are moved.
@pawanrawal might be considering reopen this issue.
@unknown321 While performing an export, we go over the data that's on disk and all keys might not have made it to disk. So if you restarted the server (which flushes the keys to disk), your export should return the correct count. I'll change the implementation to use a new iterator which combines the keys in memory with those on disk. I have created another issue for this https://github.com/dgraph-io/dgraph/issues/2070.
@jzhu077 I can look at the code and try to see whats going on. Any more pointers that you can provide would be helpful (some code to reproduce it maybe).
@pawanrawal I still having the same issue. Basically, what I did is to set up the dgraph cluster in kubernetes, 5 zeros and 30 servers (dgraph/dgraph:master) divided into 10 groups, insert some data, alter the schema and run some queries, all working as expected. I leave it running as it is for a day, the next day when I try to run the same query that was working a day ago, I get predicate foobar undefined I have tested quite a few days and I get the same error every day. It doesn't occur straight away after the predicate move all the time, but if you leave it long enough (a day) it will.
Keen to resolve this as soon as possible. I would need the answers to some questions.
Could you share the exact full error that you get? Also, do you see the predicate when you do the schema{} query or in the tablets section at http://localhost:6080/state? Do you know if it only happens for predicates that were moved or for other predicates as well? If you try the query multiple times (3-4 times), do you always get this error once it starts occurring?
Could you share the exact full error that you get?
It's either in this format : rpc error: code = Unknown desc = Got error: Schema not defined for predicate: ___kind. while running: name:"eq" args:"foobar" or Predicate ___older_sibling doesn't have reverse edge
do you see the predicate when you do the schema{} query
Yes,
{
"predicate": "___kind",
"type": "string",
"index": true,
"tokenizer": [
"exact"
]
},
{
"predicate": "___older_sibling",
"type": "uid",
"reverse": true
},
Do you know if it only happens for predicates that were moved or for other predicates as well?
No sure, the log doesn't tell which predicate is being moved.
2018/02/02 15:31:44 predicate_move.go:52: Writing 328490 keys
2018/02/02 15:31:47 predicate_move.go:52: Writing 20866 keys
2018/02/02 15:32:27 wal.go:118: Writing snapshot to WAL, metadata: {ConfState:{Nodes:[10 11 12] XXX_unrecognized:[]} Index:9003 Term:2 XXX_unrecognized:[]}, len(data): 73
If you try the query multiple times (3-4 times), do you always get this error once it starts occurring?
Yes, I always get one of the errors as I stated in the answer to the first question.
@pawanrawal are you able to reproduce the issue?
I will give it a try today.
No sure, the log doesn't tell which predicate is being moved.
Logs for Zero leader should tell you this.
@jzhu077 I was able to reproduce what I think is the issue. After predicate move, the schema is somehow not updated on replicas of a group and is only updated on the leader. A query to the follower in a group returns an error but works fine when sent to the leader. Working on a fix.
Have a fix in https://github.com/dgraph-io/dgraph/pull/2096 which I will get merged today and update here. Does the query work fine sometimes and not at other times?
The nightly build has been updated. Can you pull the latest dgraph/dgraph:master image and try again @jzhu077?
@pawanrawal The the query always failed, as it either missing the first or the second predicate in the query I ran. sorry I haven't tested a single block query which might work sometimes if what you found is correct. I will give it a try today. Thanks for the timely update :+1:
Whats your imagePullPolicy setting your config file? I am wondering if all the nodes are actually using the latest master image? There could still be an issue with the schema but just want to make sure you are using the right image.
I set it to always, so it always pulls the latest image. Looks like the issue has been fixed.
Cool, I am closing this. Feel free to open if you encounter the issue again.
Most helpful comment
This was caused by three different small issues which have been fixed in 7131e04a3090f389868a7f938fcb6c4954e21f31, 6da3c7d73960ad99d227c81e60d027590344c962 and 85af112124dfe2c582f13065cc18dab16bbaf094.