Dask: How to read directly from MongoDB or other NoSQL like Couchbase

Created on 3 Jun 2019  路  5Comments  路  Source: dask/dask

How to directly read from Mongodb or Couchbase?
It would be nice to read directly from NoSQL database like dd.read_mongo(), read_couchbase().

dataframe io

Most helpful comment

You can do this now with dask.delayed. What you will need to do, is come up with a reasonable way to partition your data, and then generate a query for each of your partitions. Then, so long as you have a function which takes one of these queries and returns the corresponding list of objects (or dataframe), you are done. The code would look something like this

@dask.delayed
def q_to_list(q):
    mongo = pymongo.MongoClient(...)
    return (apply q to mongo client)
queries = [main_query + partition_clause(i) for i in range(npartitions)]  # may be kwargs, whatever you need

# for lists of objects
import dask.bag as db
bag = db.from_delayed([q_to_list(q) for q in queries])

# for dataframes
import dask.dataframe as dd
df = dd.from_delayed([q_to_list(q) for q in queries])  # optionally provide meta= if you expect a known df structure

All 5 comments

@martindurant, any thoughts on this?

You can do this now with dask.delayed. What you will need to do, is come up with a reasonable way to partition your data, and then generate a query for each of your partitions. Then, so long as you have a function which takes one of these queries and returns the corresponding list of objects (or dataframe), you are done. The code would look something like this

@dask.delayed
def q_to_list(q):
    mongo = pymongo.MongoClient(...)
    return (apply q to mongo client)
queries = [main_query + partition_clause(i) for i in range(npartitions)]  # may be kwargs, whatever you need

# for lists of objects
import dask.bag as db
bag = db.from_delayed([q_to_list(q) for q in queries])

# for dataframes
import dask.dataframe as dd
df = dd.from_delayed([q_to_list(q) for q in queries])  # optionally provide meta= if you expect a known df structure

@martindurant
Yes, I've already been aware of that on Stackoverflow.
I've expected some direct methods like dask methods : read_csv , read_sql.
Seemingly, Such a way is the only way to interact with other databases for Now!

If you manage to make something along these lines which can be easily generalised, please submit it for inclusion - then there will be!

We don't see a lot of demand for this kind of functionality, but if you are interested in working on it I'm sure others would benefit! I'm going to close it in the meantime.

Was this page helpful?
0 / 5 - 0 ratings