Dvc: docstring format for api, possibly across code base?

Created on 17 Jan 2020  Â·  34Comments  Â·  Source: iterative/dvc

Extracted from https://github.com/iterative/dvc/pull/3130#pullrequestreview-343835585

We currently don't have a _desired style_ to write docstrings.

Should we come up with one? If so, which one?

discussion enhancement p3-nice-to-have

Most helpful comment

So we agree that we need docstrings in one shape or another, right? If so, please take a good look at the discussions in this ticket and notice how we all have different preferences. Most of us remember the constant fights we've had about the coding style a year ago, before we've started using black. We all come from different backgrounds and we all have different preferences, which will be a constant source of fights unless we agree on some style and just stick to it. "no-style" is a bad idea, because we will drown in fights as we did with code style and then we will still choose some style to put the end to our fights. So why not just skip ahead and pick something instead of wasting our energy on constant fights? That is why I'm admittedly pushing towards picking some style and sticking to it right away.

So from my research, there is 1 defacto standard tool for checking docstrings: pydocstyle and it supports three styles pep257, numpy and google. And I think we should choose one of those.

IMPO, numpy is aight but I haven't run into that docstring style that much and I suppose that is probably the reality. google is everywhere, so probably the most popular one in my experience. pep257 looks alright to me and, well, it is an active pep :) I would be fine with google or pep257, but I would pick pep257 because it doesn't have google-bias and is a de-jure active standard, that things like black mention so looks like the wisest choice.

All 34 comments

Most of the standards suggest """ for docstrings. I think we should go with the same.

https://stackoverflow.com/questions/3898572/what-is-the-standard-python-docstring-format

Thanks @bhavaniravi! I think thus much we can all agree on 🙂

The larger question is what text format to use INSIDE docstrings. Ref: https://stackoverflow.com/questions/3898572/what-is-the-standard-python-docstring-format

I don't think we should spend time formalizing this, we have very limited amount of user facing functions really. And we may find better way to spend our time than argue about docstring formatting.

@Suor I think you brought this topic, right? :) Are we good to close it (discussion around formatting) then and let Jorge decide this, for example?

I agree it's only important at the moment to have good full code doc for the API functions and I'm happy to just pick one format but would prefer to get some feedback from the @iterative/engineering team. In fact I wanted to summarize the options here (extracted from https://github.com/iterative/dvc/pull/3130#discussion_r368192562 +):

is there a standard way of doing markup that is understood by the tools that generate a view for this (like F1 in PyCharm, quick help in VS Code, docs generators, etc) ?

Among doc generators, Sphinx is by far the most popular, with pdoc a remote 2nd place, it seems. Sphinx uses reStructuredText format. pdoc is interesting because it supports Markdown and other nicer-looking formats worth checking out actually (from Numpy and Google).

As for IDEs, VSCode intellisense doesn't interpret any format in the docstrings it seems. It mostly focuses on building and inspecting the Python code by itself. From what I can see here, PyCharm auto completion works similarly.

So easy to read docstrings e.g. numpydoc or Google style may be best if we expect IDEs code completion will be the main way users reach our docstrings. And especially since we're writing the API documentation manually anyway (not generating it).


p.s. An interesting VSCode extension to help write docstrings in any of the above formats is https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring, BTW.

OK... I'm going to suggest we adopt the Google style for modules (files) and functions. Reasons:

  • We don't need to produce automatic docs (i.e.reST/Sphinx) since we're manually writing them in the dvc.org repo – still, pdoc does support generating docs from Google style docstrings just in case.
  • Seems like the easiest one to read – uses plain language, not many symbols. This is good for IDE completion which I think will be the main way people interact with our docstrings.
    Its reference is the shortest and simplest one, but covers all the basics: module license; fn descriptions, arguments, return/yield, exceptions raised.
  • Doesn't use back quotes `, so @Suor will be happy 🙂

All in favor? ✋

I am in favor of no style, it will only slow down reviews with no real gain.

I'm in favour of an easy-to-use style so it doesn't slow down reviews :)

Also in favor of not slowing down reviews for doc strings.

I'm in favor of some simple consistent style for the external API functions (better to be more or less consistent internally, but would never block any review for this). If we have a ticket like this #2316 and we deal with this, it's better to agree what are some basic expectations and how this should looks like. I'm fine with Google style.

I am in favor of no style

I think it's less about exact styles and more about completeness of docstings. I.e. Google format specifies which sections to include in the docstring (basically description, args, result/yield, throws I think). Yes, a style is attached to that but it's pretty simple... And consistency is good!

not slowing down reviews for doc strings ... some simple consistent style for the external API functions (better to be more or less consistent internally, but would never block any review for this

Yes guys, this issue is mostly about the API code. I think it's totally up to the core engineering team to decide about docstrings in the rest of the code.


I'm confused about the voting results so far: 1 for no format, 2 for Google, 2 invalid haha – no consensus yet :(

What I don't like from Google style is that it "annotates" the arguments with types.
I would prefer to have some proper type hints instead.

I'm afraid that if we start documenting everything, we will end up including useless comments just for the sake of, like:

def do_stuff(x, y):
    """A method that do_stuff given x and y
    Args: 
         x (str): This is an X
         y (str): This is a Y.
     """

Ideally, the method name should be clear enough to understand what it does, so no need to put that on the description, same with argument names.

Things that I think are important to document:

Side effects
https://github.com/iterative/dvc/blob/ec2c8ad8bc6236e9c2b54123c6ff63a823cb37e9/dvc/analytics.py#L67-L81

Explain convoluted implementations / What & Why?
https://github.com/iterative/dvc/blob/ec2c8ad8bc6236e9c2b54123c6ff63a823cb37e9/dvc/analytics.py#L26-L49

Give examples when code is hard to read
https://github.com/iterative/dvc/blob/ec2c8ad8bc6236e9c2b54123c6ff63a823cb37e9/dvc/repo/__init__.py#L271-L298

_DISCLAIMER: I wrote all of those docstrings. I'm completely bias._ :see_no_evil:

I'm afraid that if we start documenting everything, we will end up including useless comments just for the sake of

I fully agree, except if we start doing intelligent parsing of docstrings in order to auto-gen docs in such a way that requires having to (annoyingly) document everything. I prefer to not document the obvious. And any intelligent parsing should look at the func signature as well as docstrings to pick up args.

@mroutis @casperdcl 💯 about the "code must be good enough to read without comments in the first place", no doubts about this! But for this discussion, let's focus on external APIs _only_. (dvc.api.* for now).

My bias and experience here. I really like when I can press a hotkey, get to the source code of the API and see some documentation there (~ short version of what you can get online, but enough in most cases to understand everything, probably no examples, but definitely edge cases, what do we expect from the params, etc). It tremendously simplifies the experience in my case.

So, back to the question. Do we want to have a bit more then just an obvious "This is a get method, it gets something" kind of docstrings at all (again, for external, user-facing API)? If not, should we just delete existing one-sentence comments (that do not provide any value whatsoever)?

Btw, a solution might be to put just a link the docs? (similar to what we do in CLI).

enough in most cases to understand everything, probably no examples, but definitely edge cases, what do we expect from the params, etc)

Yes, definitely

Do we want to have a bit more then just an obvious "This is a get method, it gets something"

Only if needed to describe a complex operation

should we just delete existing one-sentence comments (that do not provide any value whatsoever)?

Yes if no genuine value provided

Btw, a solution might be to put just a link the docs? (similar to what we do in CLI).

Yes, though I'd recommend adding a travis job which checks for dead links.


Overall in all cases I'd always recommend the original dev does the sensible thing, whether it's comment in prose, in bullet points, in code examples, in links - whatever they think will be most useful for people who may read it. There may not be a one-style-fits-all-functions approach.

Only if needed to describe a complex operation

💯 ! In our case all operations (dvc.api) are not simple, neither self-descriptive imo.

There may not be a one-style-fits-all-functions approach.

If we though have some external APIs we want to describe args, exceptions, some edge cases for - it's better to use the same style, right? (And that's all this discussion is about.) And I believe we have those (I would find docstrings def useful for myself - again, I have a bias - I use IDEs and navigate to libraries all the time and prefer it over online docs).

I also prefer inline to online. I grudgingly accept online if well done. I don't like any duplication.

I feel like online should perhaps have prosaic expansion on examples, while inline should be as succinct as possible.

Btw, a solution might be to put just a link the docs? (similar to what we do in CLI).

I don't see this being of much value, you can just open he website and find the function easily on your own.

I'd recommend adding a travis job which checks for dead links.

Unrelated @casperdcl but if you have any experience on this, please participate in iterative/dvc.org/issues/652


OK so still no decision yet on which docstring format 😢 I may have to just start using Google style for now to get the inline doc going but we can revisit later...

@jorgeorpinel are you sure about #652? That seems even more unrelated than I was expecting.

Haha sorry @casperdcl, I meant iterative/dvc.org/issues/652!

I think it's less about exact styles and more about completeness of docstings. I.e. Google format specifies which sections to include in the docstring (basically description, args, result/yield, throws I think). Yes, a style is attached to that but it's pretty simple... And consistency is good!

I am against going for completeness too. Google format is a good example of how bloat is created by doing so. The same reason to avoid consistency if that causes bloat. Python stdlib lives happily without that.

@shcheklein the doc-string should describe what function is doing, so that people wouldn't need to guess by name. Params should only be described if that is non-obvious, formal descriptions is bloat and will only stop people from reading it or at least slow them down. Any edge cases should never make it to doc-strings, unless they represent some common use case at the same time.

Google format is a good example of how bloat is created by doing so.

don't see how does it create bloat. Any docs are valuable for me (list of exceptions to expect if it can be provided, types, brief param descriptions, etc). I do use them extensively if they exist. Agreed that we can omit parameters if they are obvious. On the other hand how is it different from CLI - why do we include brief description to seemingly obvious options?

Python stdlib lives happily without that.

Not sure I understand why should we rely on this. We can check other very popular libraries? Also, they have extensive docs for a lot of stuff even inside. Have you seen any guidelines they use by chance? Do they have any?

Any edge cases should never make it to doc-strings, unless they represent some common use case at the same time.

check stdlib, I think they have tons of notes, long descriptions like os.walk, etc. Hard to judge if they are rare edge cases or common situations. I would expect that everything that starts with Note is kind-a edge case?


To reiterate. I don't care that much about philosophy, style guides, stdlib, google, etc. As I said my bias is very simple - I prefer offline to online and I do read and use public APIs docs via IDEs extensively. Plus it's better to be consistent. That's why I voted the way I voted.

os.walk() has complicated interface, thus a long doc string. It also shows the value of examples and no value of formal style guide.

Simple functions, like ones from os.path have very simple doc-strings, usually one-liners. Any non-obvious things are mentioned briefly, see os.path.exists() for example.

Okay, to summarize.

Looks no one wants to introduce a heavy formats like google's, etc (with sections, and whatnot) and more in favor of stdlib's approach, namely: single or multi-line block of text which might (and probably should) include param names, exceptions, etc, and describes what's happening in some reasonable way.

I'm totally fine with this as well. If we all agree on this, @jorgeorpinel @efiop let's close this?

For the issue of improving (non-) existing docstrings we should create a separate issue? which I think is way more important to be honest than discussing a specific format in the first place.

@casperdcl summarized it well enough as well:

Overall in all cases I'd always recommend the original dev does the sensible thing, whether it's comment in prose, in bullet points, in code examples, in links - whatever they think will be most useful for people who may read it. There may not be a one-style-fits-all-functions approach.

in favor of stdlib's approach, namely: single or multi-line block of text which might (and probably should) include param names, exceptions, etc, and describes what's happening in some reasonable way.

It sounds pretty vague to me. Is the policy "please write some sort of docstrings when you think its reasonable"? We would be back at square one but I can't really have a strong opinion here so up to you guys 🙂

p.s. just a question: does this mean it's not allowed to use a full format? Or is it totally fee for everyone to chose their favorite docstring flavor? Bc I've already been using Google's a little bit 😋

@jorgeorpinel I would avoid mixing different styles - e.g. Google and stdlib. The way I understand stdlib - you just write a "description" that is valuable to the end-user of what's going on, using param names as terms in it.

please write some sort of docstrings when you think its reasonable

we go from a format discussion to the usefulness of them, let's move it to another topic please?

I think we should write docstrings in some sane format. Google's style is a good choice too, but maybe there is something better. It will allow editors to pick them up easily, improving the dev experience.

Sloppy docstrings like in stdlib are as good as nothing. If we start bothering with these, we might as well stick to some good format that will be actually useful.

I don't like docstrings myself, but python is too high level to live without them. Even the exceptions introduce some non-obvious flow control, that you can't really tell from looking at one function's body. So I think we need to be good developers and use docstrings to document what needs to be documented.

I just want to add a link to this discussion about exceptions in stdlib docstrings: https://github.com/iterative/dvc/pull/3352#discussion_r384206743

The summary is that they seem to just be mentioned (not even with specific type) some times (when it seems relevant). So, a pretty flexible common-sense approach as well.

@jorgeorpinel stdlib is quite old, so I would totally believe that they don't use proper docstrings because back in a day those didn't exist for python and nowadays no one wants to go through the whole codebase to consistently use new format everywhere.

Yeah, I didn't pick that style. I voted for Google style originally, but others weren't convinced and decided to use a super simple "no style" docstrings inspired on stdlib (at least for now), which is why I'm focusing on stdlib for docstring related matters. (Please refer to discussion above.)

Recently I've noticed we kind of already use Google style docstrings throughout the cod base, so I'm a bit confused again... But will continue to assume the resolution so far is in https://github.com/iterative/dvc/issues/3176#issuecomment-579872748 ^ Totally up to you guys.

I don't really see a point of the discussions here, feels like bikeshedding.
We could just start with documenting dvc.api first without any formats. At least, it's a good start, and helpful for people doing:

>> help(dvc.api.open)

Then, we could start docstring format for api later on if it provides more benefits (and, it's not hard to change formats for just dvc.api). I don't think it's important to discuss about formats for internals, yet (the benefits of having a docstring vs docstring in a format for internals is small).

I am still against any format, including googles. It only adds bloat. You should only mention things important for a specific function, not a predefined list in a predefined format, half of which people will need to skim through.

I don't care if some IDEs would be happy parsing it, I prefer it to be human readable instead.

The one thing I've seen that would actually be pretty useful is to use back quotes around param names when explaining them in the docstring, since we're avoiding a special Args section. This was discussed here early on and it was mostly Alex against it. We ended up agreeing not to use them but now that I've been writing them I kind of miss them. This is because param names or other literals tend to be common words like path, so using backquotes around them makes it obvious that they're literals and not just a word in a sentence.

def func(path, x, y)
```Calculates something with x and y argument,
   based on a file found in the given path arg.```

vs.

def func(path, x, y)
```Calculates something with `x` and `y`, based on the file in `path`.```

It's just a bit faster and more obvious using `. Maybe we can even allow this partially? (Not required but allowed when needed)

So we agree that we need docstrings in one shape or another, right? If so, please take a good look at the discussions in this ticket and notice how we all have different preferences. Most of us remember the constant fights we've had about the coding style a year ago, before we've started using black. We all come from different backgrounds and we all have different preferences, which will be a constant source of fights unless we agree on some style and just stick to it. "no-style" is a bad idea, because we will drown in fights as we did with code style and then we will still choose some style to put the end to our fights. So why not just skip ahead and pick something instead of wasting our energy on constant fights? That is why I'm admittedly pushing towards picking some style and sticking to it right away.

So from my research, there is 1 defacto standard tool for checking docstrings: pydocstyle and it supports three styles pep257, numpy and google. And I think we should choose one of those.

IMPO, numpy is aight but I haven't run into that docstring style that much and I suppose that is probably the reality. google is everywhere, so probably the most popular one in my experience. pep257 looks alright to me and, well, it is an active pep :) I would be fine with google or pep257, but I would pick pep257 because it doesn't have google-bias and is a de-jure active standard, that things like black mention so looks like the wisest choice.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dmpetrov picture dmpetrov  Â·  35Comments

dmpetrov picture dmpetrov  Â·  64Comments

kskyten picture kskyten  Â·  44Comments

drorata picture drorata  Â·  46Comments

pared picture pared  Â·  73Comments