Crystal: [Proposal] Array stable sort

Created on 3 May 2018  路  22Comments  路  Source: crystal-lang/crystal

It may be useful to have the option to stable sort arrays.

I'm thinking about adding it as a flag in Array#sort, defaulting to false, signature looks like this:

def sort(stable = false) : Array(T)

Therefore if needed, it can provide stable sorting by calling:
[1,...,-1].sort(stable: true)

Another option is of course to have a separate method (like in c++ or golang):
[1,...,-1].stable_sort

Tbh I find the version with parameter a little bit better - if doesn't add a new method to array public interface and doesn't break anything (as it defaults to false).

In spirit if this comment: https://github.com/crystal-lang/crystal/issues/2350#issuecomment-261742027

I would be very happy to implement either of this versions.

in-progress feature stdlib

Most helpful comment

I don't see a reason why you would need support for multiple sorting algorithms. One that is stable and one that is more performant should be plenty for the stdlib.

All 22 comments

I thought our current algorthm was fast and stable?

Doesn't seem so:

class Foo
  include Comparable(Foo)

  @@id = 0

  @id : Int32

  def initialize(@x : Int32)
    @id = @@id
    @@id += 1
  end

  def <=>(other : Foo)
    @x <=> other.@x
  end
end

foos = Array.new(100) { Foo.new(rand(1..2)) }

pp foos

foos.sort!

pp foos

All @ids with the same @x value should be in increasing order, but they are not.

@RX14 - the whole point is, that it isn't:

a = (1..17).to_a
puts a == a.sort { 0 } # => false

(example taken from https://github.com/crystal-lang/crystal/issues/2350#issuecomment-261704788, tested on 0.24.2)

I stand corrected! What stable sorting algorithms give the best generic results? We have introsort for our standard sort, do hybrid algorithms win in stable sorting too?

Looks like timsort is the one to go for.

Ok, so I will start working on timsort here.

Can you add an in-progress label?
Which version would be better - one with extra argument provided to sort or one with the new method?

Is there any downsides to just having one sort that's stable, given the performance is comparable? How does timsort compare to introsort?

Timsort got worst space complexity equals to n, while introsort got log n (due to recursion).

Also, there is a reason why C++, or golang have separate algorithms for stable or normal sort. When stability isn't required it's better to have better space complexity.

For the calling style, let's go with a stable: true required named argument.

@asterite you mean always require an explicit stable: true or stable: false? I don't think that's neccesary here...

I mean, with a default value of false

A default value of false would be great - it wouldn't break existing api.

I think the best would be to create an enum item for all possible sort algorithms. So later it could be expanded as needs be. Always adding a new bool parameter seems ugly.

I don't see a reason why you would need support for multiple sorting algorithms. One that is stable and one that is more performant should be plenty for the stdlib.

It has sometimes surprised me that the current sort method seems to "rearrange" the order of equal elements. It would be really nice if stable were the default (or you could go the java route, stable for objects, unstable default for primitive arrays, since it doesn't matter if they get swapped). Java gives their rationale for it here https://stackoverflow.com/a/15154287/32453 "Stability is a big deal when sorting arbitrary objects. For example, suppose you have objects representing email messages, and you sort them first by date, then by sender. You expect them to be sorted by date within each sender, but that will only be true if the sort is stable." but having an unstable option would be nice too. The latest implementations for the JDK seem to be "Timsort" for stable and "dual pivot quicksort" for unstable: https://stackoverflow.com/questions/32334319/why-does-collections-sort-use-mergesort-but-arrays-sort-does-not it would be interesting to have different types available and be able to call them out by name, like "heapsort" etc. in case they work differently on different input data (though you might only need timsort and dual pivot quicksort typically), and also maybe nice to retain an option to use quicksort on normal objects arrays when speed is still preferred...so either way :)

The only question in my head still is should stable be the default?

It makes sense to me if you want to chain .sort_by(&.x).sort_by(&.y) though I guess it could be written as .sort_by(:stable => true, &.x}.sort_by(:stable => true, &.y} but it feels to me as if stable should be the default, otherwise what does sort_by even mean when chained? So I kind of like the java way of stable by default.

(somebody once proposed possibly having #sort_by as stable and #sort as unstable, by default, that might be a bit confusing but is an option?: https://www.ruby-forum.com/t/sort-by-is-not-stable/198761). Maybe I'll run a poll on the crystal forum if there's no objection to doing so, in a few days about this question...

@rdp .sort_by { |foo| {foo.y, foo.x} }, seems clearer and faster to me.

Mmm that doesn't seem immediately clear to me...plus you can't as easily chain (or chain later based on conditions, it loses incoming initial ordering, if any.)... :|

If we want to go the extra mile then for unstable sorting of int's (that takes slightly more space) there's radix sort and for strings "burst sort." May as well go for the very fastest, right? Hmm those would be non in place sorts...now things are getting confusing option wise...though I guess we could add an :in_place option later...

@rdp there is unstable inplace and stable but non-inplace versions of radix sort. I've tried to port them to crystal some time ago, and they are pretty fast (though my implementation seems to have some bugs).

OK I am trying to figure out what the best defaults would be:

My current hypothesis is to have unstable be the default only if it is an Array of primitive type (ints, floats, Strings) that are sorted using the standard <=> comparator.

This way if developers new up an array of ints, and call sort on it, they get the fast algorithm by default (which is how java does it, primitives use an unstable sort). The theory being that "primitives" are "inter changeable" so it doesn't affect the output if, for instance, various 3's happen to get all mixed out of their initial order, it has the same final outcome as if the sort had been stable.

For "normal objects" (using the default comparator <=>) it should default to "stable sort" so that previous ordering isn't lost (like how Python and Java do, for that reason).

It seems, if blocks are used, even for Arrays of primitives, that stable sorting is preferred.

ex: if I want to put "all 3's" at the front of an array [lame example, but still reasonable]:

(1..17).to_a.sort_by{ |i| i == 3 ? -1 : 1 }

I'd expect the output to retain incoming order for equal elements, ex: [3,1,2,4,5,6, ...] , so it seems to me the default when using a block should be stable sort, as well.

In terms of actually introducing the modified behavior, in my mind if the defaults are appropriate, and overridable, maybe just do it with a release note warning breaking change, specifyunstable: trueif you need the old behavior? Is that enough?

I suppose we could first do a release with unstable: true everywhere and then warn everyone to explicitly call it out before the next release (if they depend on unstable sorts?), then change the defaults to the above (stable) with some subsequent release.

But even though sorts today are using unstable, people will still get the same "sorting order" (but not exact order) they were specifying with new stable sort defaults, so I don't think there is much, if any, dependence on unstable sorts [?].

But if you prefer to (in essence) deprecate it first (unstable: true everywhere for 1 release, then new default) that would be OK.

If no feedback I'll go ahead and move forward with a PR that includes the defaults above [and also uses merge sort for now for the stable sort--somebody could implement the more complicated Timsort or Quadsort to optimize it at some point, I'm mostly focused on getting the API and behavior right, at this point].

Thanks!

Sounds good!

I don't think we need an expensive deprecation cycle. The sorting behaviour has so far not been specified, so the results just arbitrarily reflect implementation details. Maybe someone depends on the specifics, but I doubt there's a huge impact if we change it. Maybe some spec values need to be reordered.

Was this page helpful?
0 / 5 - 0 ratings