Runtime: BitArray should include shift operators

Created on 20 May 2016  路  43Comments  路  Source: dotnet/runtime

As far as I can tell, BitArray does not currently support shift operators. These would be really useful if they could be implemented efficiently.

API Proposal

For details see https://github.com/dotnet/corefx/issues/8719#issuecomment-221659151 and https://github.com/dotnet/corefx/issues/8719#issuecomment-245046251

    public sealed partial class BitArray : System.Collections.ICollection, System.Collections.IEnumerable
    {
        ...
+       public System.Collections.BitArray RightShift(int count);
+       public System.Collections.BitArray LeftShift(int count);
        ...
    }
api-approved area-System.Collections up-for-grabs

Most helpful comment

It'd be a bit odd to have the operator overloads for the shifts but not for the other binary operations i.e. not/or/and.

All 43 comments

I think this would need an API review, for which I propose this implementation: jasonwilliams200OK@72f070c

Thanks for the code review @Clockwork-Muse. I have addressed the issues and updated the SHA-1.

@joshfree, since this is going to introduce new methods on API surface of the System.Collections contract, should this have the API review label? If API review is not required then I prepare a pull request.

Your tests don't even compile, much less the implementation passing. I've left some other notes in your current work for future consideration, but I'd actually use these tests.
Tests were validated against a brain-dead implementation, not what we'd actually want to ship.

(full disclosure, I couldn't get your changes in BitArray to compile due to ref issues, but there are ways around that for testing...)

Note that there are a couple of pitfalls for the unwary due to the existing implementation of a few methods (Not, SetAll, Length) and constructor. I'm not sure _where_ the fix would go for that - whether in the offending method or in RightShift, or whether that's "by design", but it's something to watch out for (and a reason for API Review, certainly).

There was a typo when I rebased and edited few lines. Rests of your comments are pretty subjective and your tests seem overly complicated. But anyway i don't see its worth bike-shedding at this point.

If I'm reading this right this effectively loops over all bits and just sets them again (left/right shifted).

While I can't offer an implementation, shouldn't this rather be implemented as proper shifts on the underlying bytes in the array (accounting for overflows etc.)?

In principle, something like:

C# (((bytes[3] << 24 | bytes[2] << 16 | bytes[1] << 8 | bytes[0]) << n)

@chrisaut, BitArray is internally storing values in int array (probably for efficiency?). The getter, setters and index operator on the instance take bit index and map it to corresponding location in int array. I used Get/Set to avoid duplicating that logic. Obviously this is just a proposal and there might be more efficient way to implement it even with the current impl. Or maybe the internal impl can be changed to use bool[] if it does not degrade the performance benefits int[] is delivering?

The problem with bool[] is that the processor doesn't deal with single bits, so it has to store each bool as something else, probably an int. Even if it can store it in a byte, that still results in the size getting bumped up by a factor of 8.
In theory, the runtime could compress a bool[] into something else, but there are serious stability and concurrency issues.

Yes, there is a more efficient version (I think I've roughly figured it out, but haven't done any work on it), but care has to be taken with RightShift to zero out the top bits.

Yes it stores as a int array (not byte array...I misread). But my point still stands...I don't think this should be implemented naively like presented. Each get and set does divs/rem internally. This seems very inefficient. Eg shifting a bitarray length n does n*4 divisions/rem plus all the other stuff.

There are a couple of different discussions going on in this thread:

  1. The method of storing the bits in the BitArray

    • It is somewhat odd to me that BitArray maintains an integer array, I would have expected it (based on the API) to contain either a bool array or an array of bytes that each effectively contained 8 "bits" that were near each other in index, sort of like a bucket containing a slice of the array. Like yall have already mentioned, it's probably safe to assume it's for efficiency/upsizing issues.

    • While this isn't completely out of the realm of possibility to change, we would need to have some really convincing perf justification for doing so. We'd also need to be sure our tests were _very_ comprehensive.

  2. Naive or in-depth implementation of the new APIs

    • The question here is whether we should use the Getter/Setter methods or directly work on the underlying bit storage device (whatever that may be).

    • In general the answer to this scope of questions is entirely dependent on the perf cost weighted against the source obfuscation cost. Perf is likely going to win out if the gains are more than a few operations or a few allocations, but not if it takes 50 lines to do it.

  3. API for the new methods

    • This is seems like we're in agreement on, no?

     public sealed partial class BitArray : System.Collections.ICollection, System.Collections.IEnumerable
     {
         ...
 +       public System.Collections.BitArray RightShift(int count) { return default(System.Collections.BitArray); }
 +       public System.Collections.BitArray LeftShift(int count) { return default(System.Collections.BitArray); }
         ...
     }

Thanks @ianhays! If int[] is going to stay due to its potential perf benefits, I think we can use an extra private void Move(bool fromIndex, bool toIndex) method with desired bits from Get and Set methods. That will cut off some extra branches. Instead of Set(j, Get(i)); perhaps Move(i, j)?

Has anyone taken a look at BigInteger? Its internal data is also an int array (or maybe a uint array, can't remember), and it supports shift operators. I'd imagine the same shift implementation could be used for BitArray.

@ryantrem - yeah, the implementation of BigInteger's shifts is about what I was thinking of.
It's using uint[] internally, and given this SO question and answer, BitArray should probably also be switched. This is more important we consider that under some circumstances for right shifts the sign bit won't be directly assignable, meaning whether the operation yields set or unset bits would depend entirely on other, previous operations (and thus be mysterious).

This is more important we consider that under some circumstances for right shifts the sign bit won't be directly assignable, meaning whether the operation yields set or unset bits would depend entirely on other, previous operations (and thus be mysterious).

This is a +1 for the naive solution: we wouldn't have to worry about the underlying array type and its corresponding method of sign carrying when shifting. Nothing is hidden and the behavior is plainly apparent (fill with zeroes for shifts).

Beyond that though, an unsigned int would make a lot more sense for shifting, but I'm concerned that there will be unforeseen repercussions to the other BitArray operations. That's a fairly fundamental change to make to the class to add new API.

@ianhays - actually, the sign bit thing is a sideshow in regards to RightShift, since we have to (probably) unset all high bits anyways. Shifting even a single bit requires we unset it, regardless of length.
If you have a BitArray that's only 8 bits long, and you Not it (among several options), the entire int is set; right shifting one bit pulls in another set bit, for what would appear to be mysterious reasons. There's no way to observe what you'll get ahead of time - attempting to lengthen the bit array unsets everything 'new'.

Use of uint should be investigated (although I don't anticipate any effects), but wouldn't otherwise affect the optimized version.

I agree uint (or any unsigned data type for that matter) would be better.

But I don't think the naive "loop over each bit" version should go in, even with a new Move method that only does the index calculation once (instead of set/get each doing it). If it goes in the bcl it should at least be reasonable efficient, the simple loop thing any developer can write by themselves should they need it.

I tried to write a quick implementation, I believe this to be correct for shifting, but I don't have much time right now. It's just to illustrate the idea.

Performance comparison / IL comparison / Numeric figures? You have a branch in the loop. But whichever implementation wins in smarter assembly, triumphs. I want shift operator first hand coming from the type as much as the next guy. Naive or not, build-in type method compared to external consumer implementation (where they don't have access to internal storage) is better.

BTW, nobody is merging anything to BCL code. There is no Pull Request (Google it if you don't know what that means). Your "naive" stereotyping has started to stink now.

@jasonwilliams200OK I'm sorry you think my idea stinks. It seems I offended you somehow. I believe bringing people's head's together before diving into a full on implementation is better. That's all I'm trying to do here. I will ignore your passive aggressive remarks in the hope that this discussion can continue.

I never claimed to offer a pull request. I thought perhaps people thought a non naive implementation would be too much code bloat or to complex, so I tried to see how much work it would roughly be. I though it might be interesting to share to further the discussion. The implementation is neither complete nor probably correct in all cases. As I said, it's to illustrate the idea.

To address your specific points:
Yes there is a branch in the loop. I can probably be factored out. No, I don't have numbers. I do however know that that divisions are some of the most expensive operations a CPU can perform. Much more expensive on a CPU than a branch (which can be predicted by the CPU 99% of the time). Furthermore, this loop runs once per 32 bits stored.

I would like to ask you why you prefer a loop per bit (and also @ianhays)? Do you think
a) the added complexity is not worth the performance benefits.
Or do you
b) not think there is a performance benefit in the first place.
c) something else?

When I tried to do my illustration code I assumed that people's concern was with "a" and that the general consensus was that "b" was true (eg. any implementation using shifts of the underlying ints would beat a loop per bit implementation). Perhaps I was wrong.

Again, I'm sorry I don't offer a full benchmark run and pull request.

I'm sorry you think my idea stinks

I don't think the idea stinks. I was referring to repeated use of "naive". I used the get/set methods provided by the type, they certainly are not most optimal and that can be worked out as part of this ticket or a new one.

offer a pull request

Again, was not pointing at you to open a pull request; I was responding to the fear that someone will merge the proposed patch as is in the BCL code. Usually the PRs go through a tough code review and eventually the good bits land in the official codebase.

not think there is a performance benefit in the first place.

Not sure that's why I think it would be good to have some stats handy to measure/balance out code complexity vs. perf. gain. Plus your signature is different than mine and isWholeBucketShift case might as well be replaced by Array.Clear? What is the rationale behind input int[] bitArray in the signature?

@jasonwilliams200OK - "na茂ve" in this sense is a technical term, something even a veteran programmer may try at first to verify the implementation otherwise works (or in my case, that the tests work). It's not really a reflection on the quality of the programmer.
(yes, it is used to call out the sort of implementation novice programmers may try, or never replace, but that's not how it's being used here)

Regardless, a more performant internal implementation is possible, but that's not really what we're concerned about right now; really, we care about the outputs of the method. When the shifting is done, what does the (visible) bit pattern look like? The na茂ve implementation would zero-fill anything right-shifted (since it doesn't have access to anything else), but one using the underlying array might shift set bits leftover from previous operations (currently @chrisaut 's implementation would do this, but that was only a starting/talking point anyways).

So the question is: after shifting, is it zero-filled, or whatever happens to be leftover from other operations? _Regardless of however the internals are implemented._ Methinks all "new" bits should be zeroed-out/unset, just to remove any ambiguity. Also, it would prevent people relying on the optimization behavior of retaining bits, in case the runtime eventually switches to longs or something.
It also doesn't make sense, as dealing with an abstract group of bits, for them to be set, since there's nowhere for them to "come" from, as it were.

Okay, side note: modifying the BitArray in the System.Collections in the corefx repo will have no effect (see dotnet/runtime#14564 for why). Changes instead (also?) have to go into coreclr.

At least that explains why I was having so much trouble with it earlier... annoying, though.

"na茂ve" in this sense is a technical term, something even a veteran programmer may try at first to verify the implementation otherwise works (or in my case, that the tests work). It's not really a reflection on the quality of the programmer.

Yeah, naive in this usage isn't a bad thing. Often times the naive solution is the cleanest one.

Regardless, a more performant internal implementation is possible, but that's not really what we're concerned about right now; really, we care about the outputs of the method.

:+1: we have to get the API approved before we should be worrying too much about the perf tradeoff of different implementations. It's of course important to consider our options so we can make an informed decision on the behavior of the methods, but without benchmarks and firm data the discussion will devolve.

So the question is: after shifting, is it zero-filled, or whatever happens to be leftover from other operations? Regardless of however the internals are implemented. Methinks all "new" bits should be zeroed-out/unset, just to remove any ambiguity. Also, it would prevent people relying on the optimization behavior of retaining bits, in case the runtime eventually switches to longs or something.
It also doesn't make sense, as dealing with an abstract group of bits, for them to be set, since there's nowhere for them to "come" from, as it were.

Agreed on all points: zero them out. Carrying the sign bit or storing bits from a previous operation are both non-intuitive and base behavior around implementation details or private object state. It would be far more useful to have the shifted bits be consistently known in all cases.

We have two signatures proposed so far:

``` c#
public sealed partial class BitArray : System.Collections.ICollection, System.Collections.IEnumerable
{
...

  • public System.Collections.BitArray RightShift(int count) { return default(System.Collections.BitArray); }
  • public System.Collections.BitArray LeftShift(int count) { return default(System.Collections.BitArray); }
    ...
    }
and:

``` c#
     public sealed partial class BitArray : System.Collections.ICollection, System.Collections.IEnumerable
     {
         ...
 +       public System.Collections.BitArray RightShift(int[] bitArray, int length, int count) { return default(System.Collections.BitArray); }
 +       public System.Collections.BitArray LeftShift(int[] bitArray, int length, int count) { return default(System.Collections.BitArray); }
         ...
     }

On this note, could we add a bool maskBit (or fillBit?) parameter in perhaps overload signatures, so if someone wants 1-mask the operation will fill the gap with true instead of false?

Regarding the retention of state (commutative) -> LeftShift(RightShift(bitArrayObject)) == RightShift(LeftShift(bitArrayObject)), I agree this is probably not desired. Consumer would still be able to cache the state in another instance (new BitArray(bitArrayObject)) before performing the shift operation (not efficient, but unblocks the seemingly obscure use-cases).

Regarding naive dispute, the objection wasn't about mere naive term, it was more about in the context of "if this naive code gets merged into BCL code" ... well it is not going to get into BCL code, not in its current state and at least not without the proper code review etc. etc.

We have two signatures proposed so far:

Minor correction, your second signature should have those methods static. Since none of the other BitArray operations are static, I prefer the first proposed API.

On this note, could we add a bool maskBit (or fillBit?) parameter in perhaps overload signatures, so if someone wants 1-mask the operation will fill the gap with true instead of false?

I kind of doubt that scenario is common enough to justify the overload to be honest. It'd be simple enough to just manually flip the "new" bits after the shift that I'm not certain an API to do it is necessary.

@jasonwilliams200OK yes sorry, I didn't mean naive in a negative way, as others have kindly pointed out.

I don't suggest the signature I have given in my little code snippet, we should def. go with the signature as you have presented.

C# public System.Collections.BitArray RightShift(int count) public System.Collections.BitArray LeftShift(int count)

I've just done it this way because it was the quickest to get an implementation going without actually building the corefx type/repo itself, just as an exploratory tool. It was just a quick explanatory thing. Sorry I should have mentioned this or made the effort to make the signatures match. It seems I caused more confusion, which is the opposite of what I intended.

I also def. agree on the zeroing out part.

I think BitArray should also overload the shift operators:

public static System.Collections.BitArray operator <<(System.Collections.BitArray bitArray, int count);
public static System.Collections.BitArray operator >>(System.Collections.BitArray bitArray, int count);

It'd be a bit odd to have the operator overloads for the shifts but not for the other binary operations i.e. not/or/and.

By the same token it's weird that it doesn't have them at all currently.

Mathematical operation overloads are pretty sparsely used in the framework and I can't think of any instances where they're used for Collections. For example, why doesn't List allow addition through the '+' operator? Or removal through '-'? Those are more common scenarios than the bitwise operations, so surely it would make more sense to have them than it would to have bitwise BitArray operations?

I don't know the reason for this - it may be historical or may be functional. My feeling would be that an operators functionality should always be completely obvious and limited to mathematical operations. So in the above example, adding an item to a list seems obvious but is it a mathematical addition? Not really.

Similarly, a BitArray isn't defined as a number, it is defined as a collection of numbers (bits/bools). Shifting an integer and shifting a BitArray may be functionally similar enough to call them both shifts, but overloading the operator implies a strictness of form to which the BitArray can not adhere - at least under my previous assumptions.

Regardless, the discussion should have a distinct issue of its own e.g. "Consider adding Operator overloads to BitArray".

@ianhays

For example, why doesn't List allow addition through the '+' operator? Or removal through '-'?

I don't think operators make sense for mutable types. If + returned a new List<T>, then that would be inefficient. If it modified the existing List<T>, that would be very confusing.

On the other hand, it is used for immutable types like BigInteger, TimeSpan or string. And since, for better or worse, BitArray was designed as a mutable type, it makes sense to me that it doesn't have operators.

@ianhays I do see your points. Agreed that it should be a separate issue though, will open one

My feeling would be that an operators functionality should always be completely obvious and limited to mathematical operations

If only event listeners adhered to this common sense.

Proposal

The BitArray class in System.Collections should include BitWise right and left shift operations. BitArray is used to represent a collection of on\off values and currently contains And, Or, Xor, and Not operations. Adding RightShift and LeftShift would increase the usefulness of the class and provide a canonical, fast solution to save developers the hassle of coming up with their own.

API

     public sealed partial class BitArray : System.Collections.ICollection, System.Collections.IEnumerable
     {
         ...
+        public System.Collections.BitArray RightShift(int count) { return default(System.Collections.BitArray); }
+        public System.Collections.BitArray LeftShift(int count) { return default(System.Collections.BitArray); }
         ...
     }

Behavior

Binary shifting of a BitArray would be most similar functionally to binary shifting an unsigned integer. New bits resulting from the shift will be set to zero. There is no concept of sign carrying in a BitArray and no preservation of values previously shifted out of the capacity.

Example:

c# BitArray bitArray = new BitArray(5, true); // [1,1,1,1,1] bitArray.RightShift(2); // [0,0,1,1,1] bitArray.LeftShift(2); // [1,1,1,0,0] bitArray.RightShift(3); // [0,0,0,1,1] bitArray.Leftshift(5); // [0,0,0,0,0] bitArray.RightShift(5); // [0,0,0,0,0]

Implementation

@jasonwilliams200OK and @chrisaut have both submitted initial implementations to get the discussion rolling on the implementation details.

@jasonwilliams200OK and @Clockwork-Muse have both submitted some unit tests for the above API.

@terrajobst

Looks good. If you want operators you should file a separate issue as this type currently doesn't have any operators.

    public sealed partial class BitArray : System.Collections.ICollection, System.Collections.IEnumerable
    {
        ...
+       public System.Collections.BitArray RightShift(int count);
+       public System.Collections.BitArray LeftShift(int count);
        ...
    }

Creating my issue, though just to clarify: RightShift and LeftShift return a new BitArray leaving the original unchanged?

@SamuelEnglard - no, all the operations mutate the original.

@Clockwork-Muse so why do they return a BitArray? Chaining?

@SamuelEnglard - Probably.

Side note: there's also a C++ version. Is that something we should be keeping in sync?

Shift all the bit values to left/right on count bits. The current instance is updated and returned.
Exceptions: ArgumentOutOfRangeException if count < 0

Is it correct exception behavior in my PR https://github.com/dotnet/corefx/pull/14240?

Shift all the bit values to left/right on count bits. The current instance is updated and returned.
Exceptions: ArgumentOutOfRangeException if count < 0

Yes, that is the correct behavior.

Was this page helpful?
0 / 5 - 0 ratings