Elixir: Consider adding conveniences for pmap

Created on 19 Jul 2016 · 9Comments · Source: elixir-lang/elixir

With GenStage.Flow more involved, it is clear one this pattern won't be covered by GenStage.Flow:

|> Enum.map(fn x -> Task.async(fn -> do_something(x) end) end)
|> Enum.map(&Task.await(&1, 5000))

We should consider including such in the standard library. I can think of two approaches:

pmap: collection |> Enum.pmap(&do_something/1, 5000)
Task.async_many/await_many: collection |> Task.async_many(&do_something/1) |> Task.await_many(5000)

I am more inclined towards the second. It mirrors yield_many nicely and will also give us more control when starting tasks, reducing work. We will also be able to add Task.Supervisor.async_many and Task.Supervisor.async_no_link_many.

PS: Note this is a discussion. If a PR is sent for now, there is no guarantee it will be merged.

Elixir Feature Advanced Discussion

Source

josevalim

Most helpful comment

Enum.pmap/2, would we provide something like Enum.peach/2

That's why I prefer the Task approach because it is more explicit in terms of semantics. Since it is async_something, it follows the same semantics as Task.async in terms of linking and in terms of calling Task.await_many being required.

I give for granted that we're gonna provide Stream.pmap/2

At least this one does not make sense. Stream computes item by item, which means you can't leverage parallelism on Stream.pmap/2. If you try to do that, you will end up with something like GenStage.Flow. :)

Your reply makes me think the best way to go is 2 indeed.

josevalim on 19 Jul 2016

👍3

All 9 comments

The problem I see with this approach is that it may actually be slower in many cases, than a simple Enum.map/2. We need a really good documentation to tell people they need to make sure they provide enough work for each process, or an option to Enum.pmap/3 to chunk the data before.

I'm just afraid people will simply change Enum.map/2 to Enum.pmap/2 see it's slower, and be very disappointed.

michalmuskala on 19 Jul 2016

Agreed with @michalmuskala, this is a "dangerous" feature to implement right because you really have to know what you're doing, and you have to know a bunch of stuff like how to handle errors in spawned processes (they're linked I guess?) and so on.

whatyouhide on 19 Jul 2016

I just want to point out that all of those arguments could (and were) used against Task. "It could be used wrong" is definitely a given but I wouldn't like to limit everyone using it correctly because we are afraid people will use it without properly measuring or reading the docs.

josevalim on 19 Jul 2016

Sorry, I didn't mean to discourage this feature, I like the idea :) I meant that we should have a good balance of configurability (because a bunch of things could go wrong and we want to provide ways to not make them go wrong) and "do the right thing"™iness. Also, if we provide Enum.pmap/2, would we provide something like Enum.peach/2 as well? Asking because the "do these things in parallel but who cares about the result" pattern could be common. I give for granted that we're gonna provide Stream.pmap/2 as well, right? :) So many questions!

whatyouhide on 19 Jul 2016

Enum.pmap/2, would we provide something like Enum.peach/2

I give for granted that we're gonna provide Stream.pmap/2

Your reply makes me think the best way to go is 2 indeed.

josevalim on 19 Jul 2016

👍3

Ah, I see now with Stream.pmap/2, sorry, didn't think it through. Yes from my perspective 2. looks more formal/strict and appears to leave less room to mistakes :)

whatyouhide on 20 Jul 2016

I'll throw in a 👍 for option 2. I take it that it would retain the order of the collection unlike Task.yield_many. If so, would there be room for a function that doesn't care about retaining the order or is that use case covered by Task.yield_many?

Probably a case of YAGNI.

DevL on 20 Jul 2016

👍1

Seems kind of a slippery slope, you know the first request is going to be for Task.chunk
I think if you go into it knowing that managing a worker pool is out of scope, the Task.async_many is fine. I would avoid Enum.pmap at all costs. Everyone should write that in the first week so they learn why it's not a great idea.

bbense on 21 Jul 2016

Please see #5367. We have decided to go with something that is based on GenStage.Flow.map and allows a streaming-bounded set of tasks to be computed. It is the more robust implementation of everything proposed. We have decided to call it Task.pmap because we will also add Task.Supervisor.pmap variant, which spawns supervised tasks. The rationale for going this way was after observing the usage and benefits of Flow. It is quite different from the Enum.pmap everyone writes (since the number of tasks started is bounded). Closing this in favor of the PR.

josevalim on 28 Oct 2016

Was this page helpful?

0 / 5 - 0 ratings