Dvc: Use pathlib where possible

Created on 1 Dec 2017  路  9Comments  路  Source: iterative/dvc

We're currently using pathlib only in command/init, but it might be also very useful to use it elsewhere. I.e. by using it in our path/path.py or just using it raw everywhere instead of string paths.

enhancement

Most helpful comment

I'm in for at least abstracting path arithmetics out. I think we can start doing this gradually as we are closing issues to mitigate the migration complexity @efiop has mentioned. From what I see now - it's too error prone to do this manually in many places.

As for hdfs/http, etc. I think @mroutis is right and it's probably better call it Uri. My intuition is that this class will have a lot of DVC specific methods and even if we use internally urlpath or something else, I would go with our own class for this.

All 9 comments

A few month ago I've done some work to unify all the path operation by removing pathlib :)

My opinion - any of these libraries is fine. But we should use only one.

Yeah, actually it is really bad in our case. I tried it while doing RFC and really hated it, our Path() class is much better. Closing.

@efiop , @dmpetrov what do you think about reopening this one?

We don't have a Path class right now, but it will be useful to have more than just os.path, as it is not that easy to work between windows and posix.

From the pathlib documentation:

This module offers classes representing filesystem paths with semantics appropriate for different operating systems

I think Path objects are more useful than strings for filesystem manipulation, but before working on a PR, I would like to hear more about why do you moved away from pathlib.

I've tried it and what I didn't liked was overriding the forward slash to do joining (too disruptive for me :sweat_smile:), but it is not mandatory, and you can use a more classic.join

@pared , @ei-grad , your input would be appreciated, as it is something that would affect the way we interact with the code

@mroutis This would not be a simple change and I don't think it is worth bothering with at this point. Migrating to Pathlib is very low priority.

All right, @efiop , closing it again.

Agree with @efiop - it is not easy to migrate and I don't see any good reason to use specifically pathlib.

However, I disagree about the Path class. I think we lost an important abstraction level when we removed Path class. It was a great abstraction for converting absolute path, relative path as well as dvc paths (relative to project root). As a result we have custom code spread across the project for all of these (simple) transformations. This luck of a common abstraction affects the project in many places. Just a couple examples: #1527 #1588.

I'd suggest to re-create our Path class, generalize it for http\hdfs and other protocols. Let's create an refactoring issue for introducing a new Path class. What do you think @efiop and @mroutis @pared @shcheklein?

@dmpetrov what improvements would our custom Path class bring compared to using pathlib's class? Sounds interesting.

Maybe we can write something like https://github.com/chrono-meter/urlpath for DVC's supported protocols

I'm in for at least abstracting path arithmetics out. I think we can start doing this gradually as we are closing issues to mitigate the migration complexity @efiop has mentioned. From what I see now - it's too error prone to do this manually in many places.

As for hdfs/http, etc. I think @mroutis is right and it's probably better call it Uri. My intuition is that this class will have a lot of DVC specific methods and even if we use internally urlpath or something else, I would go with our own class for this.

Was this page helpful?
0 / 5 - 0 ratings