Turicreate: SArray should have an implementation of median

Created on 25 Jun 2018  路  4Comments  路  Source: apple/turicreate

A approx/exact median is very useful for SArrays

sa.median(approx = False) # default
sa.median(approx = True)  # uses the sketch summary to compute the median
SFrame engine enhancement

Most helpful comment

@TobyRoseman We should do exactly that in the C++ layer :). I think the code is just as easy and we can also put it in the C-APIs.

All 4 comments

I like this idea. It's pretty simple to do efficiently at the Python layer. As a proof of concept, the following works:

def median(sa, approx = False):
    if not approx:
        temp = tc.SFrame({'x': sa})
        temp['dummy'] = 1
        result = temp.groupby('dummy', {'y': tc.aggregate.QUANTILE('x', 0.5)})
        return result['y'][0][0]
    else:
        sketch_summary = tc.Sketch(sa)
        return sketch_summary.quantile(.5)

@TobyRoseman We should do exactly that in the C++ layer :). I think the code is just as easy and we can also put it in the C-APIs.

The code I shared previously is not going to work when approx = False. It turns out we can not use tc.aggregate.QUANTILE if we want exact answers.

The docstring for tc.aggregate.QUANTILE contains this surprising information:

The returned quantiles are guaranteed to have 0.5% accuracy. That is to say,
if the requested quantile is 0.50, the resultant quantile value may be
between 0.495 and 0.505 of the true quantile.

Let's use the sketch for getting the exact method by using the bound guarantees of the sketch. Because it guarantees the value to be within eps * N of the original value, we set eps based on memory (it's pretty conservative, so that can be set small), then get the quantiles of the data at (0.5 - eps) *N and (0.5 + eps) *N. Call these values a and b. We then know that the true median lies between a and b.

Then, we do a second pass on the data using an apply:

atomic<size_t> n_below_a = 0; 
std::vector<flexible_type> candidates;  // could be replaced with internal lock free stack.
std::mutex candidate_lock; 

auto count_median = [&](flexible_type x) { 
   if(x < a) { ++n_below_a; }
   else if(x <= b) {
      std::lock_guard<std::mutex> lg(candidate_lock); 
      candidates.push_back(x);
   }
}); 
data.apply(count_median);   // May be reduce_with_lambda or something like that. 

Now, what's the 50% mark?  
size_t inner_n = (data.size() / 2 - n_below_a);   // do even-vs-odd etc.
// Linear time partition
std::nth_element(candidates.begin(), candidates.begin() + inner_n, candidates.end()); 
// median is now at inner_n
return candidates[inner_n];
Was this page helpful?
0 / 5 - 0 ratings