Sparklyr: Unable to understand spark_apply

Created on 25 Apr 2019 · 3Comments · Source: sparklyr/sparklyr

Hello guys,

I am developing a ML model in R and I am trying to use SparklyR for move it into production, I am having problems right now doing some feature engineering and I do not know how to use spark_apply function:

I got this variable
test_query<-sdf_sql(Connection,"Query")

Then I need to do this feature engineering:

preprocessing<-function(datasetml){


datasetml$Name_has_numbers<-grepl("[[:digit:]]" , datasetml$NAME_NORM_LONG)


datasetml$Num_words_name<-stri_count_words(datasetml$NAME_NORM_LONG)


datasetml$Num_char_w1<-nchar(word(datasetml$NAME_NORM_LONG,1))


datasetml$Num_char_w2<-nchar(word(datasetml$NAME_NORM_LONG,2))

return(datasetml)
}

How can I apply this function with spark_apply to my sdf variable named test_query?

spark_version <- "2.3.0"
sc <- spark_connect(master = "yarn-client", version = spark_version)

Source

Scotturbina

Most helpful comment

Only inside dplyr verbs like mutate(), summarise(), etc. If they receive a Spark table as input then they take the expressions inside them and translate the operation chain to Spark SQL under the hood. You can combine the Spark SQL functions with some R functions inside a dplyr verb if that R function is translatable to a SQL function.

It looks like you're working with text data, so the relevant translatable R functions are mostly from the base package: tolower, toupper, trimws, nchar, substr. If you're trying to get word counts then you might be able to do it by combining sparklyr::ft_regex_tokenizer() with an operation using the Spark SQL function size(), but I haven't tested that out at all.

benmwhite on 25 Apr 2019

👍2

All 3 comments

spark_apply() ships your data and required R packages out to every executor, runs your code through through R on your individual partitions, then puts it all back together in a Spark table. If you're only doing feature engineering then this adds way too much unnecessary overhead. Instead, you should usesparklyr::ft_* functions and/or dplyr verbs with Spark SQL functions inside them whenever possible since you avoid transferring data and code between Spark and local worker node R instances.

That said, your spark_apply() code is failing because every external package you use in your function needs to be declared explicitly. Change word() to stringr::word(), etc.

EDIT: just saw your other comment reply. You can plug Spark SQL functions into dplyr calls basically as-is. E.G my_tbl %>% mutate(day_col = dayofweek(date_col)) where dayofweek is the Spark SQL function.

benmwhite on 25 Apr 2019

🎉1 👍1

@benmwhite And can I combine Spark SQL function with an R function?

Scotturbina on 25 Apr 2019