Seurat: Any reason not to run FindMarkers on multiple cores?

Created on 4 May 2018  路  25Comments  路  Source: satijalab/seurat

Hej!

I recently implemented the following code do speed up the process of finding cluster markers in Seurat (using the BiocParallel Package). It works quite nicely for me (the results I get using this code are the same as with FindAllMarkers without parallelisation).
However, I'm still a bit skeptical... Is there any reason why one should not run FindMarkers on multiple cores?

Looking forward to hearing your opinion!

M=length(unique([email protected]$ident1stLevel))
N=M-1

FindMarker.wrapper <- function(x){
  FindMarkers(mydataset,ident.1=x, only.pos = TRUE, min.pct=0.25)
}

Markers <- bplapply(0:N, FindMarker.wrapper,BPPARAM=MulticoreParam(3))
Analysis Question

Most helpful comment

In Seurat3 try this:

library(future)
plan(strategy = "multicore", workers = 6)

All 25 comments

I'm not that familiar with the BiocParallel package but there's no inherent reason running FindMarkers on multiple cores would be problematic. Many of the differential expression functions lend themselves quite nicely to parallel solutions.

I have been doing something similar and have been considering submitting a pull request to parallelise FindAllMarkers() (probably not using BiocParallel to be consistent with the other multicore Seurat functions). Is this something you would be interested in?

I had also wondered whether it might be more efficient to do this to FindMarkers instead to parallelise over genes rather than clusters. That way it could work even when only testing a single group. From memory a quick look at the code suggested this would be difficult though, due to all the different DE methods. Any thoughts?

Thanks for your answers!
Good to hear that you don't see any hidden disadvantages in this solution.

I'm quite new to parallelised computing, so I'm afraid I cannot come up with a BiocParallel-independent solution at this point. But if you have any idea how to solve it without BiocParallel, feel free to go ahead and submit a pull request! I would be very interested to see your solution and learn.

Regarding your idea to parallelise over genes: if that was possible, this would be great. However, I don't have an idea how to realise it at this point.

Parallelizing over genes is probably the way to go here. It would allow for better scalability to larger clusters/distributed solutions but as you point out, it would be more work to implement as you would have to parallelize each DE method. We don't have this on the immediate timeline but if you'd like/have time to give it a shot, we could help out with the PR. Maybe starting with the default (or your favorite DE method) to see what kind of improvements we could get.

Hi Andrew,

I am willing to help in the parallelization of DE in Seurat. I was aiming at submitting a PR for the NegBinomRegDETest(), as mclapply() often crashes on my laptop, and I have to repeat the same test several times. Do you think using the foreach package would be the best option? As this was already implemented in the JackStraw() and ScaleData() functions (see PR #358 and #370 by @aymanm).

Best,
Leon

I am curious about mclapply() crashing on your laptop, are you on windows OS?
I haven't experienced many issues with foreach so I personally prefer to go with it. It also has an added advantage in ScaleData and JackStraw in that we can preserve the progress bars (if used with doSnow). I didn't find an easy way for progress bars with other options.

I work on a Mac OS. Previously, it was working fine, up to some point where it started crashing too often. I had to change many of my functions implementing mclapply().

@leonfodoulian and @lazappi,

We're currently evaluating different parallelization packages in an attempt to figure out what would work best for Seurat. mclappy can safely be ruled out as it uses forking, which means our Windows users would be unable to benefit from it. Unfortunately, this, as well as other technological restrictions, limits us to SOCK clusters using snow (the backend for most parallelization schemes in R). While this is great in that we can parallelize across all platforms, we end up with issues like #381 where messages (eg. print, cat, warning, and message) are blocked (I don't know why this happens in R, it just does and there's very little we can do about it). There is a way around it when initiating the cluster, but it means that the user gets spammed with diagnostic messages that I haven't been able to get rid of. My solution for #381 is hacky and I wouldn't recommend anyone following suit.

Because we're limited to clusters for parallelization, we have three main options to go down: pblapply, parLapply, and foreach. From what I've seen, pbapply and parLapply require much less boilerplate than foreach to set up. The former offers native progress bar support while the latter does not. As such, I'd recommend going with the former rather than the latter. However, this may result in some large rewrites of code to get different tests structured as functions rather than for-loops (I'm not actually sure, I haven't looked). As such, I'd recommend going with pbapply. I haven't used BiocParallel, so I don't know what that's like, but it actually looks promising. I'm going to keep playing with it, and maybe we'll opt for that.

As Andrew said, this is not on our immediate timeline, but we're more than happy to help out with PRs. If anyone has any questions, please feel free to ask and we'll do our best to answer.

@mojaveazure,

Thank you for your answer. Regarding the issue you mentionned with messages getting blocked, as was raised in one of my previous issues (#381), I also encountered such problems with mclapply(). I am not sure to what extent pblapply() and parLapply() are immune to that, as I have never used them previously. However, I had previously submitted a PR to fix that for the RegressOutNBreg() function (see #383). Although the solution is quite ugly, I think it can do the job.

Best,
Leon

@leonfodoulian The parApply and pbapply families are not immune to this issue as it's due to how R handles parallelization (same reason you can't modify global variables using parallelization in R, it simply doesn't support it for whatever reason). Your hack is similar to mine, but it's still a hack. When creating the cluster with parallel::makeCluster, you can pass output = '' and have warnings and messages restored. However, you'll also get diagnostic information about the cluster, which gets in the way of the warnings and messages. I haven't found a way to get around that yet. As I've played around with BiocParallel, it appears that they've gotten around it. I have no idea how and am unable to replicate it on my own (they use parallel as a backend, so it's not like they've got completely different infrastructure being used).

After a bit of searching and experimenting, I realized using doSnow as the backend for foreach is the culprit here. I tried doParallel instead, and now print() statements within foreach would print out to the console.
_NOTE:_ This only works if you run R using the terminal, would not work in RStudio console (which is again strange).
I can help out with experimenting with pbapply and will share my findings if I come up with anything positive.
My apologies for adding more issues to the Seurat team.

@aymanm Can you share your code for setting up doParallel as the backend for foreach? Depending on how it's set up, it'll use multicore instead of clusters, which won't work for Windows users.

hi @mojaveazure
You're right, I just used doParallel:registerDoParallel(num.cores), which probably won't work for windows.
I haven't tested any other alternative backends yet, any recommendations?

May I tease you with the future ecosystem (disclaimer: I'm the author, but also a user of Seurat). The future framework provides lots of different backends (and new one can be added) for parallelization and distributed processing. It's cross platform. The motto is:

The end user decides where and how to parallelize. The developer decides what to parallelize.

This means that you as a developer only need to concentrate on what parts of the algorithm that should be parallelized, but you don't really have to think of where (multicore/forks, PSOCK clusters, multiple machines, remote machines, HPC schedulers, ...).

The future.apply package (a future "frontend") implements all base-R apply functions such they parallelize via futures. Here's a toy example,

mu <- 42  # A global variable
y0 <- lapply(1:10, FUN = function(x, each = 1L) rep(x + mu, each = each), each = 3L)

## Same via futures
library(future.apply)
## Default is sequential processing, i.e. plan(sequential)
y1 <- future_lapply(1:10, FUN = function(x, each = 1L) rep(x + mu, each = each), each = 3L)

## Parallel on the local machine (PSOCK on Windows, otherwise forked processes)
plan(multiprocess)
y2 <- future_lapply(1:10, FUN = function(x, each = 1L) rep(x + mu, each = each), each = 3L)

## Parallel on the local machine (callr backend - all platforms)
plan(future.callr::callr)
y3 <- future_lapply(1:10, FUN = function(x, each = 1L) rep(x + mu, each = each), each = 3L)

## Parallel on multiple machines (4 processes in total)
plan(cluster, workers = c("node1", "node2", "node2", "remote.server.org"))
y4 <- future_lapply(1:10, FUN = function(x, each = 1L) rep(x + mu, each = each), each = 3L)

## Parallel via an SGE scheduler in an HPC environment
plan(future.batchtools::batchtools_sge)
y5 <- future_lapply(1:10, FUN = function(x, each = 1L) rep(x + mu, each = each), each = 3L)

As you see, all you need to worry about as a developer is:

y <- future_lapply(1:10, FUN = function(x, each = 1L) rep(x + mu, each = each), each = 3L)

which is invariant to the backend. The user chooses how to parallelize via plan() depending on their compute resources.

Another advantage of the future framework is that you rarely have to worry about exporting global variables to workers and loading packages on workers - that is figured out automatically via static code inspection at run time. By adding argument future.seed = TRUE, you will get parallel-safe, statistical sound random number generation (RNG).

You can find more information in the vignettes of the different future packages and on my blog, e.g. https://www.jottr.org/2018/06/23/future.apply_1.0.0/.

This looks very cool. We'd be happy to consider a PR. In general, we're very supportive of adding optionality for parallelization, but as you know from the discussion above, want to make sure this works across OS.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Ping

Hi @HenrikBengtsson , we responded to you by e-mail, but let us know if you did not receive

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This in fact just did get recent activity in form of a duplicate (https://github.com/satijalab/seurat/issues/945). Is this discussion declared dead regardlss as indicated by the bot?

I for one feel strongly discouraged to contribute anything to a project that disregards community input because it takes time to get to it. Don't get me wrong: I understand that things take time and can only be worked on in a priority-based fashion. But to simply close issues and pull request with valualbe input just because nobody got around taking care of them is not the right way to handl things in my humble opinion.

Thanks for the feedback. We've replaced the stale bot with one that will only close issues marked with more-information-needed.

Re this thread, we've decided to implement parallelization via the future framework, which is in progress on the release/future branch.

In Seurat3 try this:

library(future)
plan(strategy = "multicore", workers = 6)

It works very well, thank you for this update.
And my question is how to change the "Markers" to a table? with the Cluster ID inside.

I am not good at R, would you please help me out?

It works very well, thank you for this update.
And my question is how to change the "Markers" to a table? with the Cluster ID inside.

I am not good at R, would you please help me out?

I figure out with "bind_rows" function of dplyr package.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kysbbubbu picture kysbbubbu  路  3Comments

tmccra2 picture tmccra2  路  3Comments

farhanma picture farhanma  路  3Comments

kysbbubbu picture kysbbubbu  路  3Comments

kathirij picture kathirij  路  3Comments