A couple of places in the String
docs states that an operation is O(n)
, usually when discussing a method that needs to shift elements over. However, this is not helpful to users who do not know O-notation, and n
is generally either misleading or undefined.
String::insert_str
: The operation is really O(n-idx+k)
, where n
is the length of self
, and k
is the length of string
. In fact, that's not even true if the String needs to be resized. If we wanted to cover that, we'd need to include the "amortized" qualifier in there. O(n)
is insufficient in part because if k
is large and n
is small, k
is much more important. We probably want to ditch the O-notation here entirely, and just explain that it needs to shift everything after idx
over, and copy in the characters from string
, and that this operation may be slow because of it.
String::insert
and String::remove
: These are less egregious since O(n)
is correct. However, the smaller bound of O(n-idx)
can be given, and the difference might be significant. Here too, it'd be better to just get rid of the O-notation altogether and explain what happens instead.
I think n
basically serves the purpose of indicating if the function is linear time; in big-O notation it doesn't really matter much since n and k tends to be similar and as long as it's the sum, they can be approximated.
Also C++ docs use big-O notation too so I think it's the standard of representing complexity. It's true that what n
is should be clarified, and the C++ docs are quite clear.
I'd much rather we explain or link to an explanation of big-O notation rather than remove it entirely. For example: http://bigocheatsheet.com/ might be a good resource to link to.
In fact, that's not even true if the String needs to be resized. If we wanted to cover that, we'd need to include the "amortized" qualifier in there.
If you consider allocations to be O(allocated size)
(which is "true enough in practice"), then all these methods are O(n+k)
.
I'm not sure that documenting that the methods are O(n-idx+k)
is that important - it's still O(n+k)
(by definition) and it's not that common that n-idx << n
.
When expressing big-O notation, it is typical to exclude the constant. You would never write O(2n)
or O(n - i)
because the constant is irrelevant--the purpose of big-O notation is to tell you how well an algorithm _scales with the size of data_--i.e. whether it is constant, linear, polynomial, etc. The reason we ignore the constant is because, when n
becomes sufficiently large, then for example, O(n^2)
will at some point _always_ become slower than either O(n)
or O(100n)
. It isn't that the constant doesn't affect the speed of the function, but big-O notation isn't concerned with the constant portion.
I'm also not so sure about putting a link to an explanation of big-O. Would it go on every docblock that references big-O notation? That seems like a lot of extra links to maintain in the documentation. I think having a link _somewhere_ is a fine idea, but I'm not sure where the best place would be to put that.
Big-O notation generally is used to specify asymptotic behavior, usually specifying the limiting exponential bound for computation growth with number of items (e.g., for sorting or searching algorithms) or with item size (e.g., for bignum computations). For big-O use, additive and multiplicative constant factors are not relevant.
It is always possible to specify a formula for computation growth rather than simply a big-O asymptotic-behavior bound. That is the choice for the person writing the documentation. IMO detailed formulae are of interest when reworking the compute-bound choke points in an algorithm, but often don't matter for most code. Similarly, asymptotic behavior may be relevant or not, depending on both the class of algorithm and the range of anticipated potential uses. Different considerations apply when creating personal-use tools such as a contact list manager vs. corporate tools such as an airline reservation manager.
I would prefer that the subject of big-O be included in the basic Rust documentation, with a link to an existing explanatory resource such as http://bigocheatsheet.com/. IMO most programmers who want or need to code in Rust (vs. e.g., Go) will have encountered big-O notation before, so won't need the reference.
Thank you to everyone explaining big-O notation. That was not the intention of this post. The observation is that the current bounds are misleading, because there are other non-constant factors (e.g., the length of the string you are splicing in) which may end up dominating the cost. I disagree that the current bounds are useful, precisely because they do not capture the actual cost. I think this can instead be better explained with (relatively few) words.
@rust-lang/libs , what do you feel about the accuracy of this documentation?
Some personal thoughts:
I agree that Big-O is an appropriate thing to use to document the complexity of algorithms. It is the ~universally agreed upon baseline to discuss algorithmic complexity. Constant factor overhead is a thing that you have to be aware of when using Big-O in any context, not just operations on Rust string types.
Why are y'all talking about constant factors? The issue text doesn't mention any, all the examples given are about other variables.
Oh yeah sorry I misread.
I agree that Big-O is widely used when describing the complexity of algorithms. I am not trying to detract from the general value of Big-O, nor say that Big-O is bad. However, I do not believe that that's the documentation for a programming language's standard library should not generally be approached the same way as describing general computer science algorithms. It's true that String::insert_str
is an algorithm in the sense that every sequence of operations is an algorithm, but I think it's misleading here to just say O(n)
and be done with it. Sure, there are some functions that really are algorithms in the more formal sense, and that have amortized bounds (like operations on BTreeMap
), but in those cases I think that should be documented in addition to a statement like "this operation is generally fast". The complexity should also be tackled more holistically in the module-level documentation.
Concretely, I'd prefer that we give a more exact estimate of the complexity in the cases (like String::insert_str
) where we know much more realistically what the cost is. That leads to fewer surprises for the user when String::insert_str
suddenly takes a really long time, and doesn't really need Big-O. I sense that there's a lot of resentment towards the idea of not including Big-O bounds here (though I neither agree nor understand that resentment — the goal should be clarity), so I'm fine with them staying in, but at the very least they should then also include important non-constant factors (like k
and idx
as outlined in the original issue).
Most helpful comment
I think
n
basically serves the purpose of indicating if the function is linear time; in big-O notation it doesn't really matter much since n and k tends to be similar and as long as it's the sum, they can be approximated.Also C++ docs use big-O notation too so I think it's the standard of representing complexity. It's true that what
n
is should be clarified, and the C++ docs are quite clear.