dvc: looks like Repo.active_stages() is not working

Created on 1 Aug 2019  路  14Comments  路  Source: iterative/dvc

It calls .pipelines(), which uses the whole graph instead of active one. That could be fixed, I would however say that removing distinction between active and all stages for purpose of push/pull/gs/status might be a good idea.

bug p1-important

Most helpful comment

I would say not pushing locked deps makes no sense. So am still for keeping
the current behavior - pushing everything.

All 14 comments

By active we mean not locked?

@pared yes.

Another issue is that we always build both active and full graph, but use only one. This is inefficient.

The implementation is also non-obvious. Now we implement Repo.active_stages() via Repo.pipelines(), instead of calling Repo.graph() directly, which makes everyone reading that code wonder whether .pipelines() is filtering stages on top of that. Consequently even if we change .piplelines() to use active graph it won't change .active_stages() since locked stages are added to active graph anyway, they are just not connected via their deps. They are still connected via their outs though.

So what active stage means is a mystery. Currently all stages are active, this is the reason why I opened them.

Judjing from how we use Repo.used_cache() and Repo.collect() active means whatever we want to push/pull/status by default. Also, dvc tag subcommands commands act on active stages, which looks like a bug.

So, it looks like that locked flag was misused as "do-not-push" flag on top of its original "do-not-repro" meaning and this is the root cause.

This is what I suggest:

  • keep current push/pull/status semantics,
  • drop "active stages" notion, since it is the same as all stages. Drop Repo.active_stages(), replace its calls with .stages(), drop active flag in Repo.used_cache(),
  • update Repo.graph() to only build either full or active graph (depending on flag:)

    • if we need "DVC-file is locked not reproducing" message on repro, then keep current active graph meaning

    • otherwise simply drop locked stages from it.

I agree with @Suor, when working on #2428 I caught myself few times on analysing what locked actually means. I don't see any reason why one would not like to push locked files. @efiop do you think it is reasonable to drop active_pipelines, and only make sure that locked stages are not reproduced?.
That would also mean removing gathering of G_active in Repo.graph()

@pared , there was already an issue about pushing locked files: https://github.com/iterative/dvc/issues/1382

I would say not pushing locked deps makes no sense. So am still for keeping
the current behavior - pushing everything.

Agree, @Suor

@Suor

keep current push/pull/status semantics,

:+1:

drop "active stages" notion, since it is the same as all stages. Drop Repo.active_stages(), replace its calls with .stages(), drop active flag in Repo.used_cache(),

It is not the same though, dependencies of locked stages should not be a part of active stages. We need to see where active_stages or its derivatives are used and whether or not things are broken for them and if we need to keep the current behavior for things other than push/pull/status -c.

update Repo.graph() to only build either full or active graph (depending on flag:)
if we need "DVC-file is locked not reproducing" message on repro, then keep current active graph meaning
otherwise simply drop locked stages from it.

building both at the same time is excessive, indeed. We could only build full graph and then for active graph take full one and drop dependencies of locked stages.

@efiop

We need to see where active_stages or its derivatives are used and whether or not things are broken for them and if we need to keep the current behavior for things other than push/pull/status -c.

active_stages has only been used here, and it seems to me that this elif was matched when used in

I don't think there is any reason why wouldn't we get used cache in those operations with locked files.

I am not sure whether making Repo.graph build only one graph is a way to go. Now we build two graphs, simultaneusly. That might be memory ineffective, but if we will build only one graph, we later need to iterate over all stages for second time and drop locked dependencies, which will prolong method execution.

There is no way to build from the beggining taking active flag into account, due to _check_cyclic_graph at the end of the function. For active=True case, test contained in test_repro.TestReproCyclicGraph will fail to detect circular dependency.

We can always build both graphs and return one depending on active flag. How about that?

@pared if you look at how .graph() is used you will see that only one graph is used at any time. So there is no reason to build both.

I can't see how TestReproCyclicGraph will fail if there is no locked stages there, but overall I see your point - if we don't build full graph we can't be sure there is no cycles there.

On building both and returning one, this is basically the same we do now, we only through that one a little but later - on the calling side, so we may just leave it as is then. Only remove .active_stages() and its calls, think about .graph() a bit more later.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

siddygups picture siddygups  路  3Comments

anotherbugmaster picture anotherbugmaster  路  3Comments

TezRomacH picture TezRomacH  路  3Comments

gregfriedland picture gregfriedland  路  3Comments

nik123 picture nik123  路  3Comments