Previously it was believed that the --store_xids flag had the same behave as --xidmap. However, it is not true. But the --xidmap flag is needed in Bulkloader.
Dgraph version : v2.0.0-beta1
Dgraph SHA-256 : 178663a98a3d59879a3d5c42928c89eb5f83afc2bfc0093272941e7a53515847
Commit SHA-1 : 6fac5d7c4
Commit timestamp : 2020-01-30 14:45:54 +1100
Branch : HEAD
Go version : go1.13.7
Yes.
Update: Ignore these steps, go to my last comment below.
dgraph bulk --store_xids -f ./agrovoc_2019-11-04_lod.nt --format=rdf -s schema.sch
dgraph bulk -x -f ./agrovoc_2019-11-04_lod.nt --format=rdf -s schema.sch
It should create an XID folder to be used in later imports via Live loader.
Origin of this issue: https://discuss.dgraph.io/t/bulk-loader-x-option/6115
Bulk loader's --store_xids flag is not the same as the live loader's --xidmap flag. If --store_xids is set, then the bulk loader writes out xid edges into Dgraph during the bulk load process.
The --store_xids flag isn't intended to write out an xid-uid mapping to a separate directory.
Okay, but what is the purpose? How do we use --store_xids?
Both uses very similar flags. If they don't do the same behavior, this should be stated somewhere in the docs. And/Or use a different flag. Also, an example of usage should be documented.
Should be a simple change. Let's bring the two in-sync.
Okay guys, after some time trying to identify the contexts. I will share my findings.
1 - In fact --store_xids and --xidmap are totally different things.
2 - As there is no documentation or any specific test for both* (actually there are tests for XIDMAP, not the --store_xids). It is suggested that they would behave in the same way by having similar names. But it's not true.
Perhaps the tests have been lost over time, as this feature is very old and is related to RDFs.
It is necessary to update the documentation (I can do this) and also to share the two functions between both loaders in my opinion. Both features available in the two tools would be very good for users.
3 - About the tools.
--xidmap:
A feature that we only have in Live load. It is very useful to create a mutation pipeline using Live load. Where we can reuse the blank nodes (and XIDs) with each new data ingestion without having to use the Upsert Block for example.
It is very useful for those who maintain consistent control over the use of Blank nodes naming (and XIDs too, e.g. data coming from RDF triple stores).
Liveloader asks the user to enter a path so that it can be saved in posting lists (I guess it is posting lists, but they are files from Badger). That way you can reuse the XIDs with each new data ingestion just by indicating the location of the previously saved XIDs.
This functionality does not exist in the Bulk loader as my tests concluded.
--store_xids:
A feature that exists only in Bulkloader. It takes the XID or blank node and saves it to the same node as a property of it with the predicate name as "xid". e.g.:
<_:MichelDiz> <name> "Michel" .
It will be recorded as
{
"data": {
"q": [
{
"name": "Michel",
"xid": "_:MichelDiz"
}
]
}
This feature is not compatible with --xidmap.
This functionality at first does not seem useful. But I'm sure it's related to the approach on external IDs https://docs.dgraph.io/mutations/#external-ids
e.g.:
<0x321> <xid> "https://www.themoviedb.org/person/32-robin-wright" .
It can be useful in this case and we can use Upsert Block. But it is not useful for those who need to ingest large amounts of data. Only small cases.
Thanks for the detailed info @MichelDiz!
So, the --store_xids option in bulkload is working after all. But the original expected behaviour and result problem, to have bulkload create an xid map the same as live loader, is still open.
Should we close this issue and open a new one?
This issue is for Bulkloader and it is fine (it is already assigned internaly), in this case now to support --xidmap in bulk loader. Maybe open a new one for Liveloader to support --store_xids there if the case.