Chapel: Version of Random.choice() that chooses from indices rather than array elements

Created on 7 Feb 2020 · 28Comments · Source: chapel-lang/chapel

As a Chapel user, I might like a version of the choice() routine from the Random module that returns indices from a range lo..hi (or 1D domain {lo..hi}?) rather than from the elements of an array. For instance, this could be useful in populating an m-element array with a subset of random indices from an n-element array (essentially a permutation of a random subset of the array's indices).

Looking at the implementation, it appears that this might be fairly easy by replacing code like return A[randIdx]; with simply return randIdx; (if one were to do a code clone) or by making the helper functions that implement choice() a bit more generic.

Libraries / Modules easy / straightforward Feature Request user issue

Source

bradcray

Most helpful comment

So we will keep the older implementation for large N?

I think so. I would propose the following next steps, unless @e-kayrakli or @bradcray objects:

Implement the algorithm above and do a small experiment to confirm the randomness is sound
- One simple check is to generate large number of small samples for a small N and confirm we get a roughly equal number of each possible outcome
Check if K < log(N) -- if not, use the existing algorithm, else use the new algorithm described above.

ben-albrecht on 17 Apr 2020

👍2

All 28 comments

Hey @bradcray, I would like to work on this.

Can you please provide a sample test case of how do we currently use Random.choice()?

I tried something like this

use Random;
var A2 = [-1.1, -2.2, -3.3, -4.4, -5.5];
writeln(choice(A2));

But it didn't worked.

Thanks!

Aniket21mathur on 7 Feb 2020

Hi @Aniket21mathur: I don't actually know a lot about choice() myself and was pointed to it when asking how to fill an m-element array with unique random indices from 1..n where n >> m.

In a situation like this, I'd use git grep to find tests in $CHPL_HOME/test that call choice(). In practice, most of the standard package and module libraries like this tend to have tests in the library/ subdirectory. So:

$  cd $CHPL_HOME/test/library
$ git grep -l 'choice('
packages/Sort/RadixSort/ips4o-like.chpl
standard/Random/RandomStreamInterface/choiceTest.chpl
standard/Random/RandomStreamInterface/choiceTestErrors.chpl

I'd suggest looking at those tests to see if they help describe how it works beyond what's given in the documentation (which is here: https://chapel-lang.org/docs/master/modules/standard/Random/PCGRandom.html#PCGRandom.PCGRandomStream.choice)

bradcray on 8 Feb 2020

@Aniket21mathur I played some more with choice(), and this code seems to work:

use Random;

var rng = new owned RandomStream( eltType= real, seed= 200 );
//var rng = new owned RandomStream( eltType= int, seed= 200 );   // this also works but not sure about details...

var arr = ["a", "b", "c"];

var n = 20;

writeln();
for i in 1..n do writeln( rng.choice( arr ) );

writeln();
for i in 1..n do writeln( rng.choice( arr, size= 3 ) );

writeln();
for i in 1..n do writeln( rng.choice( arr, size= 3, replace= false ) );

which gives

b
a
...
c
b
b

a b b
c b c
b c a
...
a c b
c a a
b b c

a c b
c b a
c a b
...
b c a
b c a
c b a

For the above results, I guess these routines may be related (by going through Random.chpl)
https://github.com/chapel-lang/chapel/blob/master/modules/standard/Random.chpl#L972
https://github.com/chapel-lang/chapel/blob/master/modules/standard/Random.chpl#L232

ty1027 on 8 Feb 2020

@bradcray @ty1027 thanks for helping. I have proposed a fix in #14871. Please review.

Aniket21mathur on 9 Feb 2020

@bradcray - before I start reviewing #14871, I'd like to clarify a couple of things about the feature request:

I assumed the request here was for a choice() overload that accepts ranges and domains in addition to the current array support - is that correct?
Assuming (1), is it acceptable for the implementation to create an array containing all of the indices of said range or domain, or is avoiding the creation of the sample space array a requirement here?

ben-albrecht on 11 Feb 2020

I assumed the request here was for a choice() overload that accepts ranges and domains in addition to the current array support - is that correct?

Yes. Though even one that just accepted ranges would be sufficient for my motivating needs.

is it acceptable for the implementation to create an array containing all of the indices of said range or domain, or is avoiding the creation of the sample space array a requirement here?

That would be acceptable, but regrettable I think, since it could require a lot of space for something that isn't really necessary.

bradcray on 11 Feb 2020

Yes. Though even one that just accepted ranges would be sufficient for my motivating needs.

OK.

That would be acceptable, but regrettable I think, since it could require a lot of space for something that isn't really necessary.

Agreed. Though the array-based solution is what I had in mind when considering this "easy/straightforward". A non-array-based solution will require most of choice to be re-implemented. Do you have a strong preference for how @Aniket21mathur's PR tackles this initially?

ben-albrecht on 11 Feb 2020

Is it a question of reimplementation, or just rewiring? Without knowing it well, my thinking was that the current implementation rolled a bunch of random indices from within an array's domain and then used them to index into the array. I was imagining making this implementation slightly more generic s.t. it would return the index directly rather than using it to index into the array to support this new behavior. It's not a 2-line change, but it seems like it might be a 20-line change?

bradcray on 11 Feb 2020

I was imagining making this implementation slightly more generic s.t. it would return the index directly rather than using it to index into the array to support this new behavior

@bradcray thanks for clarifying, what I understood before was that there is a requirement of a new function that just returns indices instead of values(so are my changes that followed in #14871 ).

I wanted to clear it again, we want changes to the same function? or design a new function with the following features? (as far as I understand):

Return a single index on not specifying size instead of the value at that index.
Return a couple of indices to cover a range depending on the size specified.

Aniket21mathur on 11 Feb 2020

Sorry to be unclear. I think of the current implementation (Ben correct me if I'm wrong) as being something like this:

1) given an array
2) query its domain
3) generate random indices from that domain
4) use those indices to index into the array and choose elements from it
5) store/return/whatever those elements

So I'm imagining the new implementation as being:

a) given a range or domain (if a range, we could convert it into a 1D domain)
b) generate random indices from that domain
c) store/return/whatever those indices

So I view steps (a-b) to be like steps (2-3) and step (c) to be like step (5) and was hoping that the guts of the two interfaces could be the same and that the only difference would be whether the domain was queried from the array on the front side and whether we generated indices or array elements (by indexing using the indices) in determining what was stored/returned/whatever. But that was just my hope from a 10,000 ft level.

bradcray on 11 Feb 2020

@bradcray - that's generally correct, though there are some additional complexities in handling the weighted sampling and sampling without replacement cases.

b) generate random indices from that domain

As a note to the implementer -- I think we'll have to generate a random number between 1..domain.size and use domain.dim(1).indexOrder() to get the i-th value of the domain.

Overall, I think it'd be great to:

add choice(arr: range, ...) and choice(arr: domain, ...) overloads. (maybe we want to rename the argument to something more generic, like X).
have choice(arr: [], ...) and choice(arr: range, ...) utilize the domain overload under the hood

ben-albrecht on 11 Feb 2020

I think we'll have to generate a random number between 1..domain.size and use domain.dim(1).indexOrder() to get the i-th value of the domain.

Does the current approach not already do something like this? How does it choose a random array element if it does not?

bradcray on 11 Feb 2020

We currently leverage the reindex method:

ref A = arr.reindex(1..arr.size);
var randIdx = stream.getNext(resultType=int, 1, A.size);

[[source](https://github.com/chapel-lang/chapel/blob/master/modules/standard/Random.chpl#L277)]

ben-albrecht on 11 Feb 2020

I see... I don't think domains support reindexing at present...(?) So that might suggest using the indexOrder() approach instead. Maybe I was too quick to add the easy / straightforward label to this...

bradcray on 11 Feb 2020

So that might suggest using the indexOrder() approach instead. Maybe I was too quick to add the easy / straightforward label to this...

Yes, agreed. From here we could:

1) Allow a contributor to do the easy/straightforward approach of creating temporary arrays, saving the harder implementation as future work
2) Try to guide the contributor through the more ideal non-temp-array implementation

@Aniket21mathur - do you have a preference for your current PR?

ben-albrecht on 11 Feb 2020

do you have a preference for your current PR?

@ben-albrecht thanks for asking.

Currently, in my pr, I have implemented a new function choiceIndex that simply returns indices in place of array elements. I have a test case here.

I went through your and @bradcray's discussion, the exact implementation that we want is still unclear to me, sorry for that. So it would be great if you and @bradcray decide the implementation and guide me through. I am happy to contribute :smile:

Aniket21mathur on 12 Feb 2020

@Aniket21mathur - I chatted with @bradcray offline and we determined pursuing (2) the more performant implementation would be preferred.

Assuming this non-straightforward task hasn't scared you away 😄, you'll want to do the following for your PR:

rename arr arg to X
add choice(X: range, ...) and choice(X: domain, ...) overloads.
modify the existing _choice(X: [], ...) implementation to expect a domain instead (_choice(X: domain, ...)
- Update the algorithm to support a domain instead of array (this is where most the work lies)
- Because we cannot use reindex, we will have to use the approach mentioned earlier
In choice(X: [], ...) and choice(X: range, ...), create a domain from the array or range and pass that (and all other defined args) to the _choice(X: domain, ...) overload.
- We use a domain as the base implementation rather than a range, to make supporting multidimensional sampling easier in the future (I would not worry about implementing that for this task, however).

Let me know if you have any questions. We can iterate further on the draft PR review as well.

ben-albrecht on 13 Feb 2020

Let me know if you have any questions. We can iterate further on the draft PR review as well.

@ben-albrecht I will go through the current implementation and will try to understand it properly before making any further changes. I will let you know if I have any queries.

Thanks!

Aniket21mathur on 13 Feb 2020

👍1

@ben-albrecht @bradcray I have some questions please clarify -

We have a prob argument passed to _choice that must be array, should we also change that argument type to domain?
Going through the implementation of _choice I see that first we are generating a reindexed reference A from 1..size,

ref A = arr.reindex(1..arr.size);

then randomly selecting elements from A and returning them. Though I am not able to understand the functionality of domain.dim(1). Using it as

var Days : domain(int) = {1, 2,3};
Days.dim(1);

throws an error

paral.chpl:4: error: unresolved call 'unmanaged DefaultAssociativeDom(int(64),true).dsiDim(1)'
$CHPL_HOME/modules/internal/DefaultRectangular.chpl:574: note: this candidate did not match: DefaultRectangularDom.dsiDim(d: int)

Thanks!

Aniket21mathur on 8 Mar 2020

Hi @Aniket21mathur -

We have a prob argument passed to _choice that must be array, should we also change that argument type to domain?

I think that should still be an array, since each value is associated with a probability.

Going through the implementation of _choice I see that first we are generating a reindexed reference A from 1..size,

Yes, this is to handle cases where a user passes in an array with a different domain, e.g. {2..size+1} or {0..#size}.

Though I am not able to understand the functionality of domain.dim(1)

Your example creates an associative array which does not have dimensions, hence the error.
See docs on domain.dim for more info.

A working example would be:

var days = {1..3};
writeln(days.dim(1));

ben-albrecht on 9 Mar 2020

Our current implementation of sampling K elements without replacement creates a temporary array, A: [1..N] int, shuffles it, and selects the first K elements as the indices for the sample.

This has space and time complexity of _O(N)_.

To avoid _O(N)_ space complexity, we could generate random numbers within 1..N and track them in a set to confirm no duplicates. This would only be worth it when K << N though, so we might want to have some check that enables this implementation only if K is sufficiently small relative to N. I am also not sure if this approach maintains randomness of selection. We would probably want to do some testing to confirm this.

@bradcray - do you have any thoughts on this?

(Also pinging @e-kayrakli as we were chatting about this offline)

ben-albrecht on 17 Apr 2020

It looks like K < log(N) as Ben suggested offline produces less than 1% clashes with generated numbers. Without worrying much about the complexity implications of that, I think the number is safe enough that our time complexity wouldn't blow through the roof because of re-generating unique random number.

See https://en.wikipedia.org/wiki/Birthday_problem
For the lazy 1-scipy.misc.factorial(k)*scipy.special.binom(n, k)/n**k is the scipy way of calculating the possibility of a clash

e-kayrakli on 17 Apr 2020

👍1

This would only be worth it when K << N though

Yes, agreed. So we will keep the older implementation for large N?

Aniket21mathur on 17 Apr 2020

So we will keep the older implementation for large N?

I think so. I would propose the following next steps, unless @e-kayrakli or @bradcray objects:

Implement the algorithm above and do a small experiment to confirm the randomness is sound
- One simple check is to generate large number of small samples for a small N and confirm we get a roughly equal number of each possible outcome
Check if K < log(N) -- if not, use the existing algorithm, else use the new algorithm described above.

ben-albrecht on 17 Apr 2020

👍2

From @Aniket21mathur in Gitter:

just want to confirm the logarithm we discussed here (https://github.com/chapel-lang/chapel/issues/14865#issuecomment-615376672) is the natural logarithm right (to the base e)?

I am not sure what @e-kayrakli tested in his comment above, but I always assume log base 2 in CS contexts unless otherwise stated. This check can be tuned more precisely in the future, but this seems like a reasonable starting point.

ben-albrecht on 20 Apr 2020

👍1

I am not sure what @e-kayrakli tested in his comment above, but I always assume log base 2 in CS contexts unless otherwise stated. This check can be tuned more precisely in the future, but this seems like a reasonable starting point.

Thanks. I will start with base 2 :-)

Aniket21mathur on 20 Apr 2020

👍1

Sorry to come late to the conversation, though I don't know if I have much of interest to add to the points above:

creating an n-element array when only k elements are needed and k << n seems unfortunate and worth avoiding when possible.
I'm personally unclear on whether choice() should generate unique numbers or not. I could imagine cases where it would be great if it didn't, but I understand that that adds expense. Assuming that there is prior art that the original choice() was based on, I think we should follow their lead. (And if the answer is "no duplicates", I don't think we should count on statistical unlikelihood to keep us safe, but should ensure that there aren't any).

bradcray on 22 Apr 2020

'm personally unclear on whether choice() should generate unique numbers or not. I could imagine cases where it would be great if it didn't, but I understand that that adds expense. Assuming that there is prior art that the original choice() was based on, I think we should follow their lead. (And if the answer is "no duplicates", I don't think we should count on statistical unlikelihood to keep us safe, but should ensure that there aren't any).

@bradcray the original choice function takes, replace as an argument to decide whether the elements are repeated or not, going through the way @ben-albrecht @e-kayrakli suggested saves us some memory, in case k << N.

Aniket21mathur on 22 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Internal error when using isSubType on generic type

ben-albrecht · 3Comments

Unable to resolve type when using type alias

BryantLam · 3Comments

Implementation of function like `codepointToString` for bytes

Aniket21mathur · 3Comments

CG: when to resolve an interface declaration?

vasslitvinov · 3Comments

How do you add a package to Mason that requires changes to the Chapel run-time/compiler?

LouisJenkinsCS · 3Comments