Python 3.6, Ray 0.8.7
What's the best way to query a large Pandas DataFrame with Ray?
# df = large Pandas DataFrame
@ray.remote
def parser(df, idx):
subdf = df.loc[idx] # in this case, df.index.is_unique == False
# Perform several calculations on subdf, return
return subdf.shape # for example
# This takes a long time when I pass df
futures = [parser.remote(df, idx) for idx in list_of_idxs]
outputs = [ray.get(f) for f in futures]
I notice that if I pass the DataFrame as an argument, building the futures list takes a very long time (I assume because it has to pickle it).
If instead I query a global DataFrame, the futures list no longer takes a long time, but oddly if I change df and rerun, it still uses the old df.
I wrote a long stack overflow answer with more details, but the tldr is to do df_ref = ray.put(df) once and then do parser.remote(df_ref, idx) to pass the reference every time.
In the case of dataframes specifically, you may also consider using modin.
here's the stack overflow post for reference: https://stackoverflow.com/a/63769670/1906826
@wuisawesome Thanks so much, that worked great!
Quick question if you have the time, which feature(s) from modin would improve this? I'm not yet understanding how to make use of it to improve this task.
The modin dataframe you interact with is already built with this idea in mind, so you could just call parser.remote(df, idx) with a modin dataframe and it would still be pretty fast. It would let you use regular pandas syntax though.
For example, you could've just done outputs = df[list_of_idxs].apply(parser, axis=1) without annotating it as a remote function. If your task is this simple, it may not want to add the additional dependency though.
Most helpful comment
I wrote a long stack overflow answer with more details, but the tldr is to do
df_ref = ray.put(df)once and then doparser.remote(df_ref, idx)to pass the reference every time.In the case of dataframes specifically, you may also consider using modin.
here's the stack overflow post for reference: https://stackoverflow.com/a/63769670/1906826