Ecto: Repo.stream using join-preload produces duplicates

Created on 15 Feb 2018  路  9Comments  路  Source: elixir-ecto/ecto

Environment

  • Elixir version (elixir -v): 1.6.1 (compiled with OTP 20)
  • Database and version: postgresql 10.2
  • Ecto version (mix deps): ecto 2.2.8
  • Database adapter and version (mix deps): postgrex 0.13.5
  • Operating system: mac osx high sierra 10.13.3

Current behavior

When streaming data using the code below, we get duplicate and incomplete records back.
Query below works correctly using Repo.all.
Fiddling with max_count option in the Repo.stream() call influences the amount of duplicates returned.

query =
      from(
        i in Invite,
        distinct: i.id,
        join: s in assoc(i, :survey),
        left_join: p in assoc(i, :participation),
        left_join: m in assoc(i, :mails),
        left_join: pe in assoc(p, :events),
        left_join: me in assoc(m, :events),
        preload: [survey: s, participation: {p, events: pe}, mails: {m, events: me}]        
      )

    query
    |> Repo.stream()

Expected behavior

Return non-duplicated results, raise or warn if preloading is not possible when using Repo.stream().

Bug Starter

All 9 comments

As discussed on slack:

This happens because we assume one chunk of the stream has all the possible duplicate rows, but actually they may span across multiple chunks.

The simplest solution is to just refuse to do join-preloads on Repo.stream. If we choose this, we should warn on 2.2 and raise on 3.0.

Another solution which is more complex would be to buffer preloads across chunks. Unfortunately, this defeats the purpose of the stream, in a way.

@michalmuskala I agree, we can't do any preload on streams.

To summarize - whoever wants to tackle this issue should implement a warning if query passed to Repo.stream contains a preload and do a PR against the 2.2 branch and one where we raise in such case against the master branch (tracking Ecto 3.0).

Done in 6655a9ac7690dfc5e796a1519f4f0f7c97fcba35.

I thought about it a bit more and it seems regular preloads work just fine, it's just that we issue one query for preload per batch. The problem is with join-preloads because we can't deduplicate the results reliably.

@michalmuskala yes, I thought about the regular preloads but I think then you can either chunk and Repo.preload or rewrite using joins, depending on the problem.

@josevalim I'm still getting issues related to this.

    query =
      from p in CmsPage,
        where: p.scope == ^scope,
        order_by: [p.updated_at]

    query
    |> join(:inner, [p], e in assoc(p, :episode))
    |> join(:inner, [p, e], s in assoc(e, :show))
    |> join(:inner, [p, e, s], sp in assoc(s, :cms_page))
    |> preload([p, e, s, sp], episode: {e, show: s, show_page: sp})
    |> Repo.stream()

And the results I'm getting have Ecto.Association.NotLoaded for these fields. Should it be raising an exception instead?

I'm using ecto -> 3.0.7 and ecto_sql -> 3.0.5.

@kurtome ~Could you please provide an example in code or repo? Unfortunately I could not reproduce the bug with my test suite.~

UPDATED: Nevermind. 馃槄

Thanks!

Was this page helpful?
0 / 5 - 0 ratings