Sparklyr: Unable to understand spark_apply

Created on 25 Apr 2019  路  3Comments  路  Source: sparklyr/sparklyr

Hello guys,

I am developing a ML model in R and I am trying to use SparklyR for move it into production, I am having problems right now doing some feature engineering and I do not know how to use spark_apply function:

I got this variable
test_query<-sdf_sql(Connection,"Query")

Then I need to do this feature engineering:

preprocessing<-function(datasetml){


datasetml$Name_has_numbers<-grepl("[[:digit:]]" , datasetml$NAME_NORM_LONG)


datasetml$Num_words_name<-stri_count_words(datasetml$NAME_NORM_LONG)


datasetml$Num_char_w1<-nchar(word(datasetml$NAME_NORM_LONG,1))


datasetml$Num_char_w2<-nchar(word(datasetml$NAME_NORM_LONG,2))

return(datasetml)
}

How can I apply this function with spark_apply to my sdf variable named test_query?

spark_version <- "2.3.0"
sc <- spark_connect(master = "yarn-client", version = spark_version)

Most helpful comment

Only inside dplyr verbs like mutate(), summarise(), etc. If they receive a Spark table as input then they take the expressions inside them and translate the operation chain to Spark SQL under the hood. You can combine the Spark SQL functions with some R functions inside a dplyr verb if that R function is translatable to a SQL function.

It looks like you're working with text data, so the relevant translatable R functions are mostly from the base package: tolower, toupper, trimws, nchar, substr. If you're trying to get word counts then you might be able to do it by combining sparklyr::ft_regex_tokenizer() with an operation using the Spark SQL function size(), but I haven't tested that out at all.

All 3 comments

spark_apply() ships your data and required R packages out to every executor, runs your code through through R on your individual partitions, then puts it all back together in a Spark table. If you're only doing feature engineering then this adds way too much unnecessary overhead. Instead, you should usesparklyr::ft_* functions and/or dplyr verbs with Spark SQL functions inside them whenever possible since you avoid transferring data and code between Spark and local worker node R instances.

That said, your spark_apply() code is failing because every external package you use in your function needs to be declared explicitly. Change word() to stringr::word(), etc.

EDIT: just saw your other comment reply. You can plug Spark SQL functions into dplyr calls basically as-is. E.G my_tbl %>% mutate(day_col = dayofweek(date_col)) where dayofweek is the Spark SQL function.

@benmwhite And can I combine Spark SQL function with an R function?

Only inside dplyr verbs like mutate(), summarise(), etc. If they receive a Spark table as input then they take the expressions inside them and translate the operation chain to Spark SQL under the hood. You can combine the Spark SQL functions with some R functions inside a dplyr verb if that R function is translatable to a SQL function.

It looks like you're working with text data, so the relevant translatable R functions are mostly from the base package: tolower, toupper, trimws, nchar, substr. If you're trying to get word counts then you might be able to do it by combining sparklyr::ft_regex_tokenizer() with an operation using the Spark SQL function size(), but I haven't tested that out at all.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mikhailBalyasin picture mikhailBalyasin  路  4Comments

isomorphisms picture isomorphisms  路  3Comments

mjcarroll1985 picture mjcarroll1985  路  3Comments

GIS
javierluraschi picture javierluraschi  路  4Comments

hanfernandes picture hanfernandes  路  3Comments