As a Chapel user, I might like a version of the choice() routine from the Random module that returns indices from a range lo..hi (or 1D domain {lo..hi}?) rather than from the elements of an array. For instance, this could be useful in populating an m-element array with a subset of random indices from an n-element array (essentially a permutation of a random subset of the array's indices).
Looking at the implementation, it appears that this might be fairly easy by replacing code like return A[randIdx]; with simply return randIdx; (if one were to do a code clone) or by making the helper functions that implement choice() a bit more generic.
Hey @bradcray, I would like to work on this.
Can you please provide a sample test case of how do we currently use Random.choice()?
I tried something like this
use Random;
var A2 = [-1.1, -2.2, -3.3, -4.4, -5.5];
writeln(choice(A2));
But it didn't worked.
Thanks!
Hi @Aniket21mathur: I don't actually know a lot about choice() myself and was pointed to it when asking how to fill an m-element array with unique random indices from 1..n where n >> m.
In a situation like this, I'd use git grep to find tests in $CHPL_HOME/test that call choice(). In practice, most of the standard package and module libraries like this tend to have tests in the library/ subdirectory. So:
$ cd $CHPL_HOME/test/library
$ git grep -l 'choice('
packages/Sort/RadixSort/ips4o-like.chpl
standard/Random/RandomStreamInterface/choiceTest.chpl
standard/Random/RandomStreamInterface/choiceTestErrors.chpl
I'd suggest looking at those tests to see if they help describe how it works beyond what's given in the documentation (which is here: https://chapel-lang.org/docs/master/modules/standard/Random/PCGRandom.html#PCGRandom.PCGRandomStream.choice)
@Aniket21mathur I played some more with choice(), and this code seems to work:
use Random;
var rng = new owned RandomStream( eltType= real, seed= 200 );
//var rng = new owned RandomStream( eltType= int, seed= 200 ); // this also works but not sure about details...
var arr = ["a", "b", "c"];
var n = 20;
writeln();
for i in 1..n do writeln( rng.choice( arr ) );
writeln();
for i in 1..n do writeln( rng.choice( arr, size= 3 ) );
writeln();
for i in 1..n do writeln( rng.choice( arr, size= 3, replace= false ) );
which gives
b
a
...
c
b
b
a b b
c b c
b c a
...
a c b
c a a
b b c
a c b
c b a
c a b
...
b c a
b c a
c b a
For the above results, I guess these routines may be related (by going through Random.chpl)
https://github.com/chapel-lang/chapel/blob/master/modules/standard/Random.chpl#L972
https://github.com/chapel-lang/chapel/blob/master/modules/standard/Random.chpl#L232
@bradcray @ty1027 thanks for helping. I have proposed a fix in #14871. Please review.
@bradcray - before I start reviewing #14871, I'd like to clarify a couple of things about the feature request:
choice() overload that accepts ranges and domains in addition to the current array support - is that correct?I assumed the request here was for a choice() overload that accepts ranges and domains in addition to the current array support - is that correct?
Yes. Though even one that just accepted ranges would be sufficient for my motivating needs.
is it acceptable for the implementation to create an array containing all of the indices of said range or domain, or is avoiding the creation of the sample space array a requirement here?
That would be acceptable, but regrettable I think, since it could require a lot of space for something that isn't really necessary.
Yes. Though even one that just accepted ranges would be sufficient for my motivating needs.
OK.
That would be acceptable, but regrettable I think, since it could require a lot of space for something that isn't really necessary.
Agreed. Though the array-based solution is what I had in mind when considering this "easy/straightforward". A non-array-based solution will require most of choice to be re-implemented. Do you have a strong preference for how @Aniket21mathur's PR tackles this initially?
Is it a question of reimplementation, or just rewiring? Without knowing it well, my thinking was that the current implementation rolled a bunch of random indices from within an array's domain and then used them to index into the array. I was imagining making this implementation slightly more generic s.t. it would return the index directly rather than using it to index into the array to support this new behavior. It's not a 2-line change, but it seems like it might be a 20-line change?
I was imagining making this implementation slightly more generic s.t. it would return the index directly rather than using it to index into the array to support this new behavior
@bradcray thanks for clarifying, what I understood before was that there is a requirement of a new function that just returns indices instead of values(so are my changes that followed in #14871 ).
I wanted to clear it again, we want changes to the same function? or design a new function with the following features? (as far as I understand):
size instead of the value at that index.size specified.Sorry to be unclear. I think of the current implementation (Ben correct me if I'm wrong) as being something like this:
1) given an array
2) query its domain
3) generate random indices from that domain
4) use those indices to index into the array and choose elements from it
5) store/return/whatever those elements
So I'm imagining the new implementation as being:
a) given a range or domain (if a range, we could convert it into a 1D domain)
b) generate random indices from that domain
c) store/return/whatever those indices
So I view steps (a-b) to be like steps (2-3) and step (c) to be like step (5) and was hoping that the guts of the two interfaces could be the same and that the only difference would be whether the domain was queried from the array on the front side and whether we generated indices or array elements (by indexing using the indices) in determining what was stored/returned/whatever. But that was just my hope from a 10,000 ft level.
@bradcray - that's generally correct, though there are some additional complexities in handling the weighted sampling and sampling without replacement cases.
b) generate random indices from that domain
As a note to the implementer -- I think we'll have to generate a random number between 1..domain.size and use domain.dim(1).indexOrder() to get the i-th value of the domain.
Overall, I think it'd be great to:
choice(arr: range, ...) and choice(arr: domain, ...) overloads. (maybe we want to rename the argument to something more generic, like X).choice(arr: [], ...) and choice(arr: range, ...) utilize the domain overload under the hoodI think we'll have to generate a random number between 1..domain.size and use domain.dim(1).indexOrder() to get the i-th value of the domain.
Does the current approach not already do something like this? How does it choose a random array element if it does not?
We currently leverage the reindex method:
ref A = arr.reindex(1..arr.size);
var randIdx = stream.getNext(resultType=int, 1, A.size);
[[source](https://github.com/chapel-lang/chapel/blob/master/modules/standard/Random.chpl#L277)]
I see... I don't think domains support reindexing at present...(?) So that might suggest using the indexOrder() approach instead. Maybe I was too quick to add the easy / straightforward label to this...
So that might suggest using the indexOrder() approach instead. Maybe I was too quick to add the easy / straightforward label to this...
Yes, agreed. From here we could:
1) Allow a contributor to do the easy/straightforward approach of creating temporary arrays, saving the harder implementation as future work
2) Try to guide the contributor through the more ideal non-temp-array implementation
@Aniket21mathur - do you have a preference for your current PR?
do you have a preference for your current PR?
@ben-albrecht thanks for asking.
Currently, in my pr, I have implemented a new function choiceIndex that simply returns indices in place of array elements. I have a test case here.
I went through your and @bradcray's discussion, the exact implementation that we want is still unclear to me, sorry for that. So it would be great if you and @bradcray decide the implementation and guide me through. I am happy to contribute :smile:
@Aniket21mathur - I chatted with @bradcray offline and we determined pursuing (2) the more performant implementation would be preferred.
Assuming this non-straightforward task hasn't scared you away 馃槃, you'll want to do the following for your PR:
arr arg to Xchoice(X: range, ...) and choice(X: domain, ...) overloads._choice(X: [], ...) implementation to expect a domain instead (_choice(X: domain, ...)reindex, we will have to use the approach mentioned earlierchoice(X: [], ...) and choice(X: range, ...), create a domain from the array or range and pass that (and all other defined args) to the _choice(X: domain, ...) overload.Let me know if you have any questions. We can iterate further on the draft PR review as well.
Let me know if you have any questions. We can iterate further on the draft PR review as well.
@ben-albrecht I will go through the current implementation and will try to understand it properly before making any further changes. I will let you know if I have any queries.
Thanks!
@ben-albrecht @bradcray I have some questions please clarify -
We have a prob argument passed to _choice that must be array, should we also change that argument type to domain?
Going through the implementation of _choice I see that first we are generating a reindexed reference A from 1..size,
ref A = arr.reindex(1..arr.size);
then randomly selecting elements from A and returning them. Though I am not able to understand the functionality of domain.dim(1). Using it as
var Days : domain(int) = {1, 2,3};
Days.dim(1);
throws an error
paral.chpl:4: error: unresolved call 'unmanaged DefaultAssociativeDom(int(64),true).dsiDim(1)'
$CHPL_HOME/modules/internal/DefaultRectangular.chpl:574: note: this candidate did not match: DefaultRectangularDom.dsiDim(d: int)
Thanks!
Hi @Aniket21mathur -
We have a prob argument passed to _choice that must be array, should we also change that argument type to domain?
I think that should still be an array, since each value is associated with a probability.
Going through the implementation of _choice I see that first we are generating a reindexed reference A from
1..size,
Yes, this is to handle cases where a user passes in an array with a different domain, e.g. {2..size+1} or {0..#size}.
Though I am not able to understand the functionality of domain.dim(1)
Your example creates an associative array which does not have dimensions, hence the error.
See docs on domain.dim for more info.
A working example would be:
var days = {1..3};
writeln(days.dim(1));
Our current implementation of sampling K elements without replacement creates a temporary array, A: [1..N] int, shuffles it, and selects the first K elements as the indices for the sample.
This has space and time complexity of _O(N)_.
To avoid _O(N)_ space complexity, we could generate random numbers within 1..N and track them in a set to confirm no duplicates. This would only be worth it when K << N though, so we might want to have some check that enables this implementation only if K is sufficiently small relative to N. I am also not sure if this approach maintains randomness of selection. We would probably want to do some testing to confirm this.
@bradcray - do you have any thoughts on this?
(Also pinging @e-kayrakli as we were chatting about this offline)
It looks like K < log(N) as Ben suggested offline produces less than 1% clashes with generated numbers. Without worrying much about the complexity implications of that, I think the number is safe enough that our time complexity wouldn't blow through the roof because of re-generating unique random number.
See https://en.wikipedia.org/wiki/Birthday_problem
For the lazy 1-scipy.misc.factorial(k)*scipy.special.binom(n, k)/n**k is the scipy way of calculating the possibility of a clash
This would only be worth it when
K << Nthough
Yes, agreed. So we will keep the older implementation for large N?
So we will keep the older implementation for large N?
I think so. I would propose the following next steps, unless @e-kayrakli or @bradcray objects:
K < log(N) -- if not, use the existing algorithm, else use the new algorithm described above.From @Aniket21mathur in Gitter:
just want to confirm the logarithm we discussed here (https://github.com/chapel-lang/chapel/issues/14865#issuecomment-615376672) is the natural logarithm right (to the base e)?
I am not sure what @e-kayrakli tested in his comment above, but I always assume log base 2 in CS contexts unless otherwise stated. This check can be tuned more precisely in the future, but this seems like a reasonable starting point.
I am not sure what @e-kayrakli tested in his comment above, but I always assume log base 2 in CS contexts unless otherwise stated. This check can be tuned more precisely in the future, but this seems like a reasonable starting point.
Thanks. I will start with base 2 :-)
Sorry to come late to the conversation, though I don't know if I have much of interest to add to the points above:
- 'm personally unclear on whether choice() should generate unique numbers or not. I could imagine cases where it would be great if it didn't, but I understand that that adds expense. Assuming that there is prior art that the original choice() was based on, I think we should follow their lead. (And if the answer is "no duplicates", I don't think we should count on statistical unlikelihood to keep us safe, but should ensure that there aren't any).
@bradcray the original choice function takes, replace as an argument to decide whether the elements are repeated or not, going through the way @ben-albrecht @e-kayrakli suggested saves us some memory, in case k << N.
Most helpful comment
I think so. I would propose the following next steps, unless @e-kayrakli or @bradcray objects:
K < log(N)-- if not, use the existing algorithm, else use the new algorithm described above.