Dvc: ask users to opt in/out of analytics, and document the feature

Created on 19 Dec 2018  路  3Comments  路  Source: iterative/dvc

I was a bit surprised to see that DVC recently added analytics tracking without asking users' permission, and without updating the docs: https://github.com/iterative/dvc/pull/1395

I bet most users would rather not send system and other information to some random endpoint in the cloud. This kind of feature (and more importantly the lack of transparency) makes it harder to trust tools like DVC.

I suggest asking users upfront during dvc init if they want to enable analytics. That adds transparency and would at least give users a chance to opt in/out.

Most helpful comment

@mdscruggs sorry, for the delay in response.

Appropriate transparency and documentation should come alongside such features, not afterwards.

Totally agree, we should have put that document before and start by modifying dvc init to mention that. It's our mistake. Even though we didn't have intention to hide anything (we had an open to everyone ticket on Github).

Even better would be to share the statistics you're gathering publicly, in the spirit of OSS (you have a nice website to put them on too!).

This our intention. To share an aggregated view of the data when we have a better understanding how to do that.

Please know my intent is to provide constructive feedback

Totally, rely on this kind of feedback to grow a healthy and open community around DVC. Thanks again!

Please take a look at these changes and let me know what your thoughts are.

All 3 comments

@mdscruggs, thank you for the great question!

Yes, we've indeed added an additional telemetry in DVC via https://github.com/iterative/dvc/pull/1395. We are working on additional documentation and changing dvc init. It will mention that we are collecting some anonymized usage stats, the way to to opt-out, motivation, terms, etc. Overall, it'll improve transparency. It was never our intention to hide this information (all analytics related stuff was done via github tickets), we just haven't had enough time to put all the things together at once.

To give some overview (just to highlight certain important things, while we are preparing the document):

  1. All information is anonymized. We don't store any personal information that can be potentially used to track someone or sell it.
  2. Specific information we collect can be always checked here - https://github.com/iterative/dvc/blob/master/dvc/analytics.py . For now, it includes things like OS version, the way DVC was installed, etc.
  3. There is a way to turn it off (opt-out): dvc config core.analytics false to disable it within a project + add --global to disable it per user, or --system to disable it for everyone.
  4. Motivation: it was a hard decision to enable this analytics and make it an opt-out by default, we've carefully considered all the existing project and discussions (homebrew, redash, django, etc) and have decided that it's worth doing at this stage because it is the only way we can keep developing product in the right direction. We don't have any intention to monetize this information whatsoever and most likely we'll make it opt-in in the future when we have a more stable product.

We'll keep this ticket open and will close it when we change dvc init and provide additional documentation. Thanks, @mdscruggs again!

Thanks for the thorough and quick reply @shcheklein, and thank you and the team for the work on DVC. It's a really useful tool!

It's good to hear the intent behind the analytics feature. My main concerns as a potential user of any open-source software include trust (in the devs/project/code), security, API stability, and of course performance/features (roughly in that order). Security risks can be introduced merely by upgrading OSS packages...which is akin to what could happen here with the release of DVC's analytics feature. Appropriate transparency and documentation should come alongside such features, not afterwards.

I suggest that your docs also include clear specifics of how the data is used in addition to how it is not used, along with how the data is stored, replicated, shared, retained, accessed, etc. Ideally the data would be rapidly aggregated, such that individual events are not persisted...although I realize that may not be feasible. Even better would be to share the statistics you're gathering publicly, in the spirit of OSS (you have a nice website to put them on too!).

Thanks again. Please know my intent is to provide constructive feedback, and that DVC is quite a nice tool.

@mdscruggs sorry, for the delay in response.

Appropriate transparency and documentation should come alongside such features, not afterwards.

Totally agree, we should have put that document before and start by modifying dvc init to mention that. It's our mistake. Even though we didn't have intention to hide anything (we had an open to everyone ticket on Github).

Even better would be to share the statistics you're gathering publicly, in the spirit of OSS (you have a nice website to put them on too!).

This our intention. To share an aggregated view of the data when we have a better understanding how to do that.

Please know my intent is to provide constructive feedback

Totally, rely on this kind of feedback to grow a healthy and open community around DVC. Thanks again!

Please take a look at these changes and let me know what your thoughts are.

Was this page helpful?
0 / 5 - 0 ratings