Dgraph: Unable to load data fast enough

Created on 29 Aug 2017  路  6Comments  路  Source: dgraph-io/dgraph

We are really interested in finding a graph solution that has what Dgraph offers: scalability and low latency. In addition we're also a Golang shop, so Dgraph has that extra appeal to us.

So we decided to give this a shot by loading a section of our graph and get an idea of how performant it is. But we haven't been able to load data into it fast enough. We're seeing similar issues to those reported in here https://github.com/dgraph-io/dgraph/issues/1323.

I first attempted to just load the N-Quads as is, without any pre-processing, but given the slowness I decided to break down the problem in order to make it easier for Dgraph (at least that's what I thought).

Input Data

  • _Subject_ and _object_ are uids.
  • File is sorted by _subject_ then _object_.
  • Only one _predicate_.
  • Schema only has the definition of indexes related to this _predicate_.
  • Total RDF's = 107,467,724
  • Sample:
<0x3e8> <has> <0x4e6a18f> .
<0x3e8> <has> <0x4e81544> .
<0x3e8> <has> <0x4e8288e> .
<0x3e9> <has> <0x4e6bdc6> .
<0x3e9> <has> <0x4e747b3> .
<0x3e9> <has> <0x4e7700a> .
<0x3ea> <has> <0x4e77ad9> .
<0x3eb> <has> <0x4e6f0e3> .
<0x3ec> <has> <0x4e6bdd6> .
<0x3ed> <has> <0x4e78008> .
  • Schema mutation:
mutation {
  schema {
    has: uid @reverse @count .
  }
}

As a side node, if I load this without the schema, it does it at least 8 times faster - which is around the speed that we're looking for, but when I run the mutation to alter the schema after all data is loaded, Dgraph dies - I don't think this is the recommended way of doing it since dgraphloader also creHardwareates the schema beforehand, but I might be wrong.

As I mentioned before, this is just a small test, we need to be able to load billions of nodes and billions of edges with different predicates at some point.

Environment:

Hardware: We are using a single AWS i3.xlarge instance which comes with a NVMe SSD drive. And we are loading the data from a separate instance in the same region.
Software: Both dgraph and dgraphloader are on commit: d14bb29e (last night's build). I'm not using the containerized docker version, I'm running the binaries directly on the host.

Problem:

Load decelerates.

$ ./dgraphloader -d=10.0.30.183:9080 -r=import/by_edge/has/dgraph_nquads_000.gz 2>dgraphloader_errors.log;

Dgraph version   : v0.8.1-dev
Commit SHA-1     : d14bb29e
Commit timestamp : 2017-08-28 11:50:19 +1000
Branch           : HEAD


Processing import/by_edge/has/dgraph_nquads_000.gz
[Request:    426] Total RDFs done:   326000 RDFs per second:   32590 Time Elapsed: 10s
[Request:    955] Total RDFs done:   855000 RDFs per second:   28497 Time Elapsed: 30s
[Request:   1547] Total RDFs done:  1447000 RDFs per second:   24115 Time Elapsed: 1m0s
[Request:   2110] Total RDFs done:  2010000 RDFs per second:   22333 Time Elapsed: 1m30s
[Request:   2689] Total RDFs done:  2589000 RDFs per second:   21574 Time Elapsed: 2m0s
[Request:   3247] Total RDFs done:  3147000 RDFs per second:   20980 Time Elapsed: 2m30s
[Request:   3668] Total RDFs done:  3568000 RDFs per second:   19822 Time Elapsed: 3m0s
[Request:   4057] Total RDFs done:  3957000 RDFs per second:   18843 Time Elapsed: 3m30s
[Request:   4432] Total RDFs done:  4332000 RDFs per second:   18050 Time Elapsed: 4m0s
[Request:   4926] Total RDFs done:  4826000 RDFs per second:   17874 Time Elapsed: 4m30s
[Request:   5327] Total RDFs done:  5227000 RDFs per second:   17423 Time Elapsed: 5m0s
[Request:  33424] Total RDFs done: 33324000 RDFs per second:   11101 Time Elapsed: 50m0s
[Request:  35001] Total RDFs done: 34901000 RDFs per second:    7587 Time Elapsed: 1h0m0s
// ... it is still running as I'm typing this, but in other attempts it has finished at a 2k throughput

Debug Vars

{
"badger_blocked_puts_total": 0,
"badger_disk_reads_total": 47414940,
"badger_disk_writes_total": 2747929,
"badger_gets_total": 206473185,
"badger_lsm_bloom_hits_total": {"l0": 324584753, "l1": 112035115, "l2": 37146773},
"badger_lsm_level_gets_total": {"l0": 12235095, "l1": 27951535, "l2": 63260625},
"badger_lsm_size": {"/mnt/nvme/dgraph/p": 2156384627, "/mnt/nvme/dgraph/w": 0},
"badger_memtable_gets_total": 211502566,
"badger_puts_total": 122806791,
"badger_read_bytes": 102106120565,
"badger_vlog_size": {"/mnt/nvme/dgraph/p": 64576369406, "/mnt/nvme/dgraph/w": 785360673},
"badger_written_bytes": 80243867194,
"cmdline": ["dgraph","--bindall=true","--memory_mb=15000","--expand_edge=false","--ui=/home/rfernandez/dgraph_assets","--p=/mnt/nvme/dgraph/p","--w=/mnt/nvme/dgraph/w","--export=/mnt/nvme/dgraph/export"],
"dgraph_active_mutations_total": 100,
"dgraph_cache_hits_total": 361267338,
"dgraph_cache_miss_total": 110181151,
"dgraph_cache_race_total": 174717,
"dgraph_dirtymap_keys_total": 186267,
"dgraph_evicted_lists_total": 109609767,
"dgraph_goroutines_total": 993,
"dgraph_heap_idle_bytes": 6607896576,
"dgraph_lcache_capacity_bytes": 793189560,
"dgraph_lcache_keys_total": 386905,
"dgraph_lcache_size_bytes": 793188010,
"dgraph_max_list_bytes": 39452384,
"dgraph_max_list_length": 42508861,
"dgraph_memory_inuse_bytes": 12849504256,
"dgraph_num_queries_total": 109952,
"dgraph_pending_proposals_total": 100,
"dgraph_pending_queries_total": 100,
"dgraph_posting_reads_total": 110181151,
"dgraph_posting_writes_total": 88409991,
"dgraph_predicate_stats": {"c.lname": 292922252, "lname": 89255446, "r.lname": 89234091},
"dgraph_proc_memory_bytes": 19921903616,
"dgraph_read_bytes_total": 84559661357,
"dgraph_server_health_status": 1,
"dgraph_written_bytes_total": 70487946284,
"memstats": {"Alloc":8963250192,"TotalAlloc":2020920819360,"Sys":21364221840,"Lookups":1586,"Mallocs":13210040065,"Frees":13166182091,"HeapAlloc":8963250192,"HeapSys":20309409792,"HeapIdle":8938749952,"HeapInuse":11370659840,"HeapReleased":984285184,"HeapObjects":43857974,"StackInuse":33554432,"StackSys":33554432,"MSpanInuse":156645120,"MSpanSys":216399872,"MCacheInuse":153600,"MCacheSys":163840,"BuckHashSys":2528191,"GCSys":744185856,"OtherSys":57979857,"NextGC":10319325728,"LastGC":1503976353314899060,"PauseTotalNs":6567042619,"PauseNs":[450021,13702456,52200084,38309945,1104040,37943873,7596804,492608,18258016,4043979,1793351,1204981,601302,4625393,820716,3232886,682065,423625,5674024,441029,880748,504589,1870794,6822540,6016953,42954801,18205298,40862487,8125980,6317011,677771,587036,3422816,6222603,10171193,2746294,29327622,4222189,2224106,649396,1451712,4848965,1415604,31836090,49060618,225269656,14861457,106632666,17928821,51841096,36572963,876914,381600,670208,812400,560833,415268,436779,123558760,321161,644482,525605,503911,501292965,215970231,955573,384644,321009,495778,427255,347695,2361252,361273,338017,777888,316493,526614,526104,289900,310915,331833,16324485,551149,3399623,1427754,6304547,678314,3677840,2785488,39198312,89965521,12484554,15176568,85213366,1141284,3146899,2549493,18408451,29311473,6069551,4808902,4812430,4109350,8891661,4033860,25460346,4180864,21280078,5503286,36683517,30791656,1548801,691586,15907559,1728902,5460066,3690359,1903116,1036912,31843277,14388836,12661816,3805085,11961204,7803629,2112285,17005139,4873164,926390,60930562,9220439,7457572,965358,6022629,4717710,3623183,6689112,2630952,18430765,18606680,23771148,17688069,5913987,2732792,2743109,5368781,1776667,29330511,13860687,7098940,1144888,18418657,12570719,1396128,9226831,3824884,18946164,5571940,6280841,10357427,16304227,1115903,14079055,6442038,751713,78559767,42900525,41433124,82606933,18339993,34522561,13554888,34331090,2490092,1376431,6281435,14246295,26794001,25762227,17361969,26208048,12046849,2453174,2676467,4299434,7037433,12218639,3366570,2492992,6856344,5558513,3128149,38710138,36121085,3956189,718617,2400081,3197314,3346898,7735968,3499743,16101580,16948671,1905953,17448716,2163593,28039796,6993666,1034642,6579303,689898,10826103,1868740,5244359,7075791,2915045,10370688,14019471,13054728,2953108,1657719,9195997,1347175,64481123,40164080,151199261,1058637,2187480,22139066,15531626,41701160,3693008,4760096,6439572,23139599,9070885,4181916,1098088,4216378,1004303,3147223,4265295,16502275,13553068,3837264,2023744,1669261,752928,514392,12219442,2028179,653107,5170455,6177560,5541170,25416234],"PauseEnd":[1503974448423297586,1503974460047309179,1503974471995356483,1503974484120590253,1503974495775652633,1503974507410833063,1503974518797460311,1503974529656590602,1503974540631006574,1503974551785903191,1503974562927411804,1503974574077232155,1503974585294760271,1503974595575861789,1503974606206816896,1503974616597917168,1503974626921031484,1503974638030012334,1503974648427688424,1503974659040517183,1503974670166956869,1503974680883024960,1503974692407646063,1503974703016675964,1503974718406893885,1503974740725352780,1503974751901170037,1503974762935339841,1503974783217768654,1503974794745133832,1503974805516588082,1503974815882835268,1503974827049544275,1503974837631697094,1503974848384115002,1503974859689070554,1503974872043864043,1503974883739404531,1503974895222895150,1503974906947031025,1503974918526310091,1503974930170859876,1503974941919950681,1503974952514454953,1503974963139879431,1503974974456449102,1503974986267823164,1503974997211747260,1503975008483724927,1503975019647538367,1503975030583752181,1503975042768490746,1503975055435508193,1503975068652578151,1503975081736975650,1503975094514122379,1503975109155512134,1503975123587429086,1503975142550882163,1503975163568865438,1503975181108847683,1503975196851575036,1503975214769866516,1503975232318771537,1503975247125885586,1503975264426003323,1503975278312608311,1503975295768813026,1503975316913460399,1503975341119174310,1503975358397806268,1503975374709518015,1503975402686350290,1503975425938854725,1503975445589433290,1503975472184194797,1503975496941997978,1503975517965641075,1503975539460262267,1503975563719865844,1503975598592908425,1503975621882308561,1503975632519790692,1503975642963339776,1503975655256524285,1503975668467641977,1503975682342942344,1503975697905853729,1503975717111047496,1503975737219208473,1503975758827690713,1503975779928858703,1503975801500050300,1503975822639168255,1503975844256084319,1503975866158131747,1503975889788025602,1503975912017380799,1503975934773245843,1503975955569036155,1503975985558233628,1503976012227463063,1503976035109484728,1503976071188019328,1503976094185113021,1503976117488904550,1503976146259337656,1503976175424190909,1503976196839646532,1503976216194693603,1503976233129004828,1503976246995619380,1503976265702329687,1503976282107292604,1503976299291870465,1503976317192804835,1503976335096478469,1503976353314899060,1503972515458680424,1503972534042809017,1503972553087991067,1503972572714567830,1503972591584715781,1503972610893737320,1503972630621788256,1503972652091311636,1503972675766235713,1503972699164480855,1503972719098368265,1503972738463787054,1503972757969998293,1503972775730455814,1503972794434144810,1503972812829983878,1503972831791841770,1503972850318894935,1503972867454471285,1503972884418234226,1503972900771161639,1503972917611729233,1503972934139226198,1503972952775497958,1503972967877334114,1503972983110166167,1503972998262073494,1503973013561998282,1503973028802554801,1503973042777078765,1503973055975492965,1503973069289552705,1503973082489630382,1503973096392633999,1503973110771511898,1503973125412773776,1503973139441823217,1503973153831967525,1503973167955652628,1503973182337836739,1503973195195404530,1503973208363727855,1503973222046640214,1503973235579928286,1503973249589242086,1503973263833836666,1503973278641952757,1503973291456968385,1503973303739560950,1503973315763077656,1503973327327245694,1503973340829978186,1503973358131814774,1503973373612819496,1503973387978386963,1503973401871002550,1503973414423465587,1503973427535488902,1503973440878868827,1503973454713931605,1503973468130734066,1503973481834724806,1503973496137087791,1503973510554657088,1503973524568359607,1503973538638309998,1503973552947007176,1503973572188502062,1503973586390592928,1503973600419961095,1503973614157910459,1503973626997356709,1503973640128252555,1503973653628864546,1503973666996703181,1503973680410189997,1503973692368002488,1503973704073615548,1503973716127301261,1503973729441803843,1503973743494271769,1503973757399789255,1503973771654951935,1503973787953487443,1503973803356825055,1503973816693947162,1503973829518502128,1503973842278793150,1503973855514326535,1503973867988998080,1503973879856147233,1503973891816037955,1503973903533338038,1503973914684241889,1503973926179219733,1503973937596142509,1503973949226360228,1503973960560603012,1503973972215296900,1503973983802624691,1503973994982422638,1503974006057216837,1503974017205458893,1503974028297817157,1503974038787713736,1503974047995947967,1503974056955385764,1503974066017774577,1503974076500290457,1503974089940395970,1503974103604252606,1503974116185298663,1503974129149751964,1503974142155718963,1503974156176392119,1503974179125162233,1503974193491760044,1503974214155784495,1503974227455760760,1503974239316924577,1503974252166585316,1503974263787851351,1503974275128020211,1503974286364315166,1503974297601280809,1503974308934950421,1503974319864928077,1503974330818579266,1503974341537331221,1503974352161413772,1503974362836328847,1503974373615391931,1503974384543016264,1503974394994012091,1503974405963366266,1503974416669643377,1503974426591578852,1503974437207136294],"NumGC":630,"NumForcedGC":0,"GCCPUFraction":0.03451973733266252,"EnableGC":true,"DebugGC":false,"BySize":[{"Size":0,"Mallocs":0,"Frees":0},{"Size":8,"Mallocs":930065576,"Frees":925681556},{"Size":16,"Mallocs":4547986088,"Frees":4536828380},{"Size":32,"Mallocs":2185837067,"Frees":2177861362},{"Size":48,"Mallocs":569004663,"Frees":564390619},{"Size":64,"Mallocs":118343186,"Frees":118085027},{"Size":80,"Mallocs":224128818,"Frees":223141320},{"Size":96,"Mallocs":227449061,"Frees":226270322},{"Size":112,"Mallocs":2968016,"Frees":2957926},{"Size":128,"Mallocs":1681613703,"Frees":1671480768},{"Size":144,"Mallocs":212906593,"Frees":212732726},{"Size":160,"Mallocs":111674291,"Frees":111363595},{"Size":176,"Mallocs":1439595,"Frees":1433598},{"Size":192,"Mallocs":3377146,"Frees":3366499},{"Size":208,"Mallocs":111715246,"Frees":110652057},{"Size":224,"Mallocs":1017810,"Frees":1013474},{"Size":240,"Mallocs":763785,"Frees":760153},{"Size":256,"Mallocs":8556657,"Frees":8535054},{"Size":288,"Mallocs":1394539,"Frees":1388156},{"Size":320,"Mallocs":1550892,"Frees":1544606},{"Size":352,"Mallocs":1441131,"Frees":1343028},{"Size":384,"Mallocs":1314887,"Frees":1307760},{"Size":416,"Mallocs":914811,"Frees":883923},{"Size":448,"Mallocs":854525,"Frees":850680},{"Size":480,"Mallocs":882808,"Frees":879278},{"Size":512,"Mallocs":3594298,"Frees":3586984},{"Size":576,"Mallocs":2729475,"Frees":2723594},{"Size":640,"Mallocs":4256312,"Frees":4247365},{"Size":704,"Mallocs":2779821,"Frees":2766288},{"Size":768,"Mallocs":1063471,"Frees":1057549},{"Size":896,"Mallocs":1608893,"Frees":1602020},{"Size":1024,"Mallocs":2831672,"Frees":2823444},{"Size":1152,"Mallocs":2087278,"Frees":2084956},{"Size":1280,"Mallocs":3207878,"Frees":3202802},{"Size":1408,"Mallocs":1850876,"Frees":1842942},{"Size":1536,"Mallocs":1135105,"Frees":1128871},{"Size":1792,"Mallocs":1966287,"Frees":1958529},{"Size":2048,"Mallocs":556228971,"Frees":554962722},{"Size":2304,"Mallocs":3205679,"Frees":3198919},{"Size":2688,"Mallocs":3356650,"Frees":3348055},{"Size":3072,"Mallocs":627593,"Frees":625099},{"Size":3200,"Mallocs":316829,"Frees":315910},{"Size":3456,"Mallocs":638676,"Frees":636638},{"Size":4096,"Mallocs":1821722,"Frees":1817460},{"Size":4864,"Mallocs":3445546,"Frees":3441201},{"Size":5376,"Mallocs":707632,"Frees":705376},{"Size":6144,"Mallocs":447940,"Frees":445957},{"Size":6528,"Mallocs":176747,"Frees":176702},{"Size":6784,"Mallocs":226788,"Frees":226578},{"Size":6912,"Mallocs":93356,"Frees":93096},{"Size":8192,"Mallocs":1751882,"Frees":1748448},{"Size":9472,"Mallocs":1643160,"Frees":1641873},{"Size":9728,"Mallocs":157614,"Frees":157419},{"Size":10240,"Mallocs":297870,"Frees":297336},{"Size":10880,"Mallocs":345495,"Frees":344894},{"Size":12288,"Mallocs":212126,"Frees":210911},{"Size":13568,"Mallocs":171274,"Frees":171017},{"Size":14336,"Mallocs":124381,"Frees":124309},{"Size":16384,"Mallocs":958510,"Frees":957271},{"Size":18432,"Mallocs":693900,"Frees":693283},{"Size":19072,"Mallocs":167704,"Frees":167576}]}
}

I owe you guys the Prometheus + Graphana stats, I yet need to set that up.

Questions

  • What setup would provide the fastest throughput assuming we're able to pre-process the input data to some extent: break down, sort, define uids?
  • Would multi-node help?
  • Where do I need to scale in order to improve the throughput? We could move to another type of instance after the load finishes because our update cycle allows it. So is the load more CPU, memory or disk bound? Although you suggest AWS i3 for Badger, I didn't see disk being hammered too hard during the load.
  • What other settings can we play with?
  • Would defining the schema afterwards help (if it didn't kill the process)?

Thanks in advance for any help you can provide! We understand it isn't at v1.0 yet but it is already a GREAT product and we're truly enjoying experimenting with it. But we need to overcome this first obstacle of getting a good amount of data there. I'm curious about the details of your setup for loading the SO data. How did it look like and how fast were you able to load it.

kinbug

All 6 comments

Forgot to mention: another issue seen in previous load experiments was disk usage grew too high during the load, but checking it on the next day it went down from 102GB to 12GB.

This was with 4% of the small part of the graph we're currently trying to load, so it would require 2.5TB only during the load because then it just needs like 300GB.

That is just something that caught my attention.

Hey @fervic,

Regarding performance, we're working on a bulk loader solution which would cut down on the network calls, write-ahead logs, raft, posting list rewrites, etc. We expect this solution to provide much better throughput for bulk loading purposes than the live Dgraph server. Expect that to come out in the next couple of weeks. It would help with your use case, and the other use cases which need to load up billions of edges into Dgraph.

What setup would provide the fastest throughput assuming we're able to pre-process the input data to some extent: break down, sort, define uids?

I think i3 with local SSD is good.

Would multi-node help?

Probably not for write throughput currently. What happens right now is that the dgraphloader generates batched mutations which would hit multiple servers in a multi-node cluster. This causes every query to block resources on all the respective servers, hence decreasing the write throughput. We need to make it so that each batch mutation can be handled entirely by just one server; so we can achieve better write throughput.

Where do I need to scale in order to improve the throughput? We could move to another type of instance after the load finishes because our update cycle allows it. So is the load more CPU, memory or disk bound? Although you suggest AWS i3 for Badger, I didn't see disk being hammered too hard during the load.

Badger, our internal KV store is pretty lightweight in terms of disk writes. We recommend SSDs to benefit from the random reads that we need for better throughput.

Dgraph is definitely memory bound. The more memory it has, the faster it would be.

What other settings can we play with?

Not much in a single server setting.

Would defining the schema afterwards help (if it didn't kill the process)?

Nope. Define it upfront, that's the best.

@manishrjain thank you very much for your detailed response. I guess we just need to wait until this new tool is written. I would be more than happy to help early testing it.

@fervic I'm working on the tool at the moment. Some early testing against large data-sets such as yours would be extremely helpful for us. I'll let you know when I've got something ready to test.

Bulk loader is being released as part of v0.8.2.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

djdoeslinux picture djdoeslinux  路  4Comments

marvin-hansen picture marvin-hansen  路  4Comments

xhochipe picture xhochipe  路  3Comments

pjebs picture pjebs  路  4Comments

andrewsmedina picture andrewsmedina  路  4Comments