Chapel: deque/list design: API

Created on 28 Feb 2019 · 16Comments · Source: chapel-lang/chapel

We will be replacing the current list/deque functionality on arrays with distinct types. This issue concerns the design of the API for those types.

Python:

list
deque

Rust:

vec
VecDeque

Libraries / Modules Design

Source

benharsh

Most helpful comment

Something that's being debated a bit here: Do you have an opinion as to whether Lists should have value or reference semantics? (i.e., should they be exposed to the user as a record or a class?)

I like consistency. Whatever arrays do, do the same thing. If that behavior isn't desirable, then maybe arrays have the wrong behavior. Ideally, the [compiler] optimizations available to arrays should be applicable to these lists too.

Regarding default intents,

How would a user mark a record or this/these list variants with intent of "behave the same as arrays" at a boundary call site, etc.?
There might be a problem with arrays behaving with default intent of "magic ref" where it is ref or const ref depending on blah. These lists feel like they should have the same behavior, but from a design point of view, maybe arrays should actually just be ref instead of magic ref.

Another thing to think about in addition: is subclassing a container (IE, list) a common paradigm?

I'm not a great person to ask because I hate subclassing unless the hierarchy is naively apparent. I'd rather do composition. That said, I'd say no; there's too many pitfalls with allowing users to subclass a container. One such issue being that you then expose the internals of that container thereby breaking encapsulation. The alternative approach is with something like a concept, trait, or constrained generic.

BryantLam on 18 May 2019

👍2

All 16 comments

So it seems like designing an API for a "list" type that mirrors the Python "list" type shouldn't be too hard. My main concern is actually what the consensus is regarding what @mppf said in another thread.

does it move elements on growing (I want the answer to be no, for parallel safety reasons, as we discussed)

Another question is - does it have the capability to move elements at all?

If it can't move elements at all, we can't have:

pop(i)

remove(i)

insert(i, elt)

I think it'd be reasonable to go either way on this. I think it's really important append does not invalidate references to the elements, but we could live with having such a reference and then doing a remove earlier being a user error IMO.

@benharsh, @bradcray, @mppf, How do you all stand regarding:

Can a vector move elements on growing? Why is this bad (beyond invalidating references to elements in the container)?
What do you gain from not being able to move elements at all? What do you lose?
What might a potential implementation look like given these restrictions? (Something along the lines of "we'd use a linked list, "we'd use an array", "we'd use a list of arrays").

I think the rest of the API design has pretty much fallen into place.

dlongnecke-cray on 3 May 2019

I could live with a list type that moves elements, but think that a list based on an approach like Michael's/Vass's linked-list-of-blocks-of-doubling-size approach would be efficient while also avoiding moving elements (the other downside of doing so apart from invalidating references being cost of copying and potentially memory overhead while sloshing from the old copy to the new).

I don't know whether there is a downside to such an approach.

bradcray on 3 May 2019

I don't know whether there is a downside to such an approach.

The downside that I know of is that there is an extra memory indirection when doing random access into the list. However I think that choosing to enable more parallelism (with append not invalidating references) is a worthwhile tradeoff for us.

mppf on 3 May 2019

My answers to @dlongnecke-cray's questions:

Can a vector move elements on growing? Why is this bad (beyond invalidating references to elements in the container)?

I don't want it to. In my mind, this is really about issue #8339 and related patterns. I think that we should make an effort to remove the possibility for super subtle bugs like those that arise from a reference to an element being invalidated while a list/set/map is growing. These problems are particularly subtle and challenging to fix in the context of parallel code (like the forall loop in that issue) and I think addressing them is part of making a good parallel language.

What do you gain from not being able to move elements at all? What do you lose?

One part is that references won't be invalidated - one task can modify the data structure while another task reads it. I'm not proposing that we keep elements alive if they are deleted in one task while another task is using them, so I think there will always be an issue. (You could argue that removing elements and then adding them back in will always "move" the elements).

A second part is that the operations that move elements will require implementation effort and be relatively slow by nature (i.e. O(n)). I think we're better off without them if the data type still addresses most use cases.

Note that if you really want to push/pop from both ends, having a different deque data type (that has a different interface and can be implemented with 2 lists) is probably a better idea. That way, the list can have better API, speed, and memory overhead properties for common usage.

What might a potential implementation look like given these restrictions? (Something along the lines of "we'd use a linked list, "we'd use an array", "we'd use a list of arrays").

It'd look more or less like the VectorList that @vasslitvinov added in #12791. Note I mentioned on that PR some academic work that both allows O(1) indexing and keeps the space overhead to O(sqrt(n))

The data structure is a collection of blocks. Each block contains elements. The elements are not moved (so references to an element can only be invalidated if that element is removed from the collection / the entire collection is deleted).

@vasslitvinov's VectorList uses a linked list of blocks. I would consider using an array of block pointers, personally. (Yes that can still be parallel safe; you'd protect modifications to the array of block pointers with a lock, and references to elements in that will never leak.). However this choice seems to me to have more performance impact than it has API impact. (Especially in parallel, trading off append speed vs lookup speed).

I think it's worth trying to get indexing to be O(1), which it isn't in the VectorList now. One way to do that is with an array of block pointers, but that's not necessarily the only way. I'd view it as a show-stopper if indexing is O(n) but O(log n) might be acceptable. However I don't think this choice has much API design impact on its own. (however I think the API design should allow implementations with O(1) indexing).

mppf on 3 May 2019

👍1

Thanks for your detailed response! I appreciate it, lots of reading to do now.

dlongnecke-cray on 3 May 2019

Minor notes.

Parallel safety is highly desirable. This obviously leads us to potentially providing multiple variants for lists, etc. that trade off flexibility and performance.
Another list variant of interest is one that can find() or remove() from the middle of the list.

BryantLam on 9 May 2019

Hi Bryant —

Another list variant of interest is one that can find() or remove() from the middle of the list.

Are you envisioning a list that removes from the middle in O(1) rather than O(n) time? (and using a strategy other than a linked list?) Do you have a specific model in mind? (e.g., lazy deletion?)

bradcray on 9 May 2019

@mppf -- Note that if you really want to push/pop from both ends, having a different deque data type (that has a different interface and can be implemented with 2 lists) is probably a better idea. That way, the list can have better API, speed, and memory overhead properties for common usage.

I agree with this statement.

For performance reasons, the main data structures I would primarily use would be akin to Python list or C++ vector with O(1) indexing, backed by an array or maybe a linked list or tree of arrays.

Some variants I would want:

variant like what @mppf is proposing where you can index in roughly O(1) time but you can only push/pop the back.
traditional deque with ability to push/pop the front and back.
variant that allows items to be removed from the middle (find/remove support), probably in O(n) time.
possibly more variants of the above in thread-safe and non-thread-safe forms for performance reasons, but that's not an immediate concern since it is optimizing something we don't have yet.

BryantLam on 16 May 2019

@BryantLam: Something that's being debated a bit here: Do you have an opinion as to whether Lists should have value or reference semantics? (i.e., should they be exposed to the user as a record or a class?)

bradcray on 17 May 2019

Another thing to think about in addition: is subclassing a container (IE, list) a common paradigm? Forwarding is presented as an alternative, but it really comes down to the most common expected use case.

dlongnecke-cray on 17 May 2019

Another thing to think about in addition: is subclassing a container (IE, list) a common paradigm?

I don't think that's important for us. In my mind, the plan is to add interfaces/concepts/constrained generics. At that point, subclassing will be an available implementation technique but not necessary to allow code to work with different list-like things. Generic programming allows code to work with different list-like things now, but the compiler doesn't help check that only the functions/methods provided by a "list-like thing interface" are used (vs. functions/methods specific to one List variant, say). Additionally, (eventually), the interfaces/concepts/constrained generics effort can support run-time polymorphism for implementations of an interface.

mppf on 17 May 2019

👍2

Something that's being debated a bit here: Do you have an opinion as to whether Lists should have value or reference semantics? (i.e., should they be exposed to the user as a record or a class?)

Regarding default intents,

How would a user mark a record or this/these list variants with intent of "behave the same as arrays" at a boundary call site, etc.?
There might be a problem with arrays behaving with default intent of "magic ref" where it is ref or const ref depending on blah. These lists feel like they should have the same behavior, but from a design point of view, maybe arrays should actually just be ref instead of magic ref.

Another thing to think about in addition: is subclassing a container (IE, list) a common paradigm?

BryantLam on 18 May 2019

👍2

Resolved by discussion and review in https://github.com/chapel-lang/chapel/pull/13030. Implementation coming soon!

dlongnecke-cray on 21 May 2019

Design question. Why is List.append not named as List.push so as to be the opposite of List.pop? It boggles my mind that append is the opposite of pop.

BryantLam on 11 Nov 2019

@BryantLam - we based the naming and interface on Python's list which has append and pop - see https://docs.python.org/3/tutorial/datastructures.html#more-on-lists

mppf on 11 Nov 2019

Well that's annoying, but okay. :+1:

BryantLam on 12 Nov 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Overloading assignment (=) on class types

bradcray · 4Comments

parallel block scan results are incorrect

BryantLam · 3Comments

Internal compiler error in user-defined follower iterator with `in` intent

npadmana · 3Comments

Should "new initializers" support field initialization prior to super.init()?

bradcray · 3Comments

CG: when to resolve an interface declaration?

vasslitvinov · 3Comments