So it seems like designing an API for a "list" type that mirrors the Python "list" type shouldn't be too hard. My main concern is actually what the consensus is regarding what @mppf said in another thread.
does it move elements on growing (I want the answer to be no, for parallel safety reasons, as we discussed)
Another question is - does it have the capability to move elements at all?
If it can't move elements at all, we can't have:
- pop(i)
- remove(i)
- insert(i, elt)
I think it'd be reasonable to go either way on this. I think it's really important append does not invalidate references to the elements, but we could live with having such a reference and then doing a remove earlier being a user error IMO.
@benharsh, @bradcray, @mppf, How do you all stand regarding:
I think the rest of the API design has pretty much fallen into place.
I could live with a list type that moves elements, but think that a list based on an approach like Michael's/Vass's linked-list-of-blocks-of-doubling-size approach would be efficient while also avoiding moving elements (the other downside of doing so apart from invalidating references being cost of copying and potentially memory overhead while sloshing from the old copy to the new).
I don't know whether there is a downside to such an approach.
I don't know whether there is a downside to such an approach.
The downside that I know of is that there is an extra memory indirection when doing random access into the list. However I think that choosing to enable more parallelism (with append not invalidating references) is a worthwhile tradeoff for us.
My answers to @dlongnecke-cray's questions:
- Can a vector move elements on growing? Why is this bad (beyond invalidating references to elements in the container)?
I don't want it to. In my mind, this is really about issue #8339 and related patterns. I think that we should make an effort to remove the possibility for super subtle bugs like those that arise from a reference to an element being invalidated while a list/set/map is growing. These problems are particularly subtle and challenging to fix in the context of parallel code (like the forall loop in that issue) and I think addressing them is part of making a good parallel language.
- What do you gain from not being able to move elements at all? What do you lose?
One part is that references won't be invalidated - one task can modify the data structure while another task reads it. I'm not proposing that we keep elements alive if they are deleted in one task while another task is using them, so I think there will always be an issue. (You could argue that removing elements and then adding them back in will always "move" the elements).
A second part is that the operations that move elements will require implementation effort and be relatively slow by nature (i.e. O(n)). I think we're better off without them if the data type still addresses most use cases.
Note that if you really want to push/pop from both ends, having a different deque data type (that has a different interface and can be implemented with 2 lists) is probably a better idea. That way, the list can have better API, speed, and memory overhead properties for common usage.
- What might a potential implementation look like given these restrictions? (Something along the lines of "we'd use a linked list, "we'd use an array", "we'd use a list of arrays").
It'd look more or less like the VectorList that @vasslitvinov added in #12791. Note I mentioned on that PR some academic work that both allows O(1) indexing and keeps the space overhead to O(sqrt(n))
The data structure is a collection of blocks. Each block contains elements. The elements are not moved (so references to an element can only be invalidated if that element is removed from the collection / the entire collection is deleted).
@vasslitvinov's VectorList uses a linked list of blocks. I would consider using an array of block pointers, personally. (Yes that can still be parallel safe; you'd protect modifications to the array of block pointers with a lock, and references to elements in that will never leak.). However this choice seems to me to have more performance impact than it has API impact. (Especially in parallel, trading off append speed vs lookup speed).
I think it's worth trying to get indexing to be O(1), which it isn't in the VectorList now. One way to do that is with an array of block pointers, but that's not necessarily the only way. I'd view it as a show-stopper if indexing is O(n) but O(log n) might be acceptable. However I don't think this choice has much API design impact on its own. (however I think the API design should allow implementations with O(1) indexing).
Thanks for your detailed response! I appreciate it, lots of reading to do now.
Minor notes.
Hi Bryant —
Another list variant of interest is one that can find() or remove() from the middle of the list.
Are you envisioning a list that removes from the middle in O(1) rather than O(n) time? (and using a strategy other than a linked list?) Do you have a specific model in mind? (e.g., lazy deletion?)
@mppf -- Note that if you really want to push/pop from both ends, having a different deque data type (that has a different interface and can be implemented with 2 lists) is probably a better idea. That way, the list can have better API, speed, and memory overhead properties for common usage.
I agree with this statement.
For performance reasons, the main data structures I would primarily use would be akin to Python list or C++ vector with O(1) indexing, backed by an array or maybe a linked list or tree of arrays.
Some variants I would want:
@BryantLam: Something that's being debated a bit here: Do you have an opinion as to whether Lists should have value or reference semantics? (i.e., should they be exposed to the user as a record or a class?)
Another thing to think about in addition: is subclassing a container (IE, list) a common paradigm? Forwarding is presented as an alternative, but it really comes down to the most common expected use case.
Another thing to think about in addition: is subclassing a container (IE, list) a common paradigm?
I don't think that's important for us. In my mind, the plan is to add interfaces/concepts/constrained generics. At that point, subclassing will be an available implementation technique but not necessary to allow code to work with different list-like things. Generic programming allows code to work with different list-like things now, but the compiler doesn't help check that only the functions/methods provided by a "list-like thing interface" are used (vs. functions/methods specific to one List variant, say). Additionally, (eventually), the interfaces/concepts/constrained generics effort can support run-time polymorphism for implementations of an interface.
Something that's being debated a bit here: Do you have an opinion as to whether Lists should have value or reference semantics? (i.e., should they be exposed to the user as a record or a class?)
I like consistency. Whatever arrays do, do the same thing. If that behavior isn't desirable, then maybe arrays have the wrong behavior. Ideally, the [compiler] optimizations available to arrays should be applicable to these lists too.
Regarding default intents,
Another thing to think about in addition: is subclassing a container (IE, list) a common paradigm?
I'm not a great person to ask because I hate subclassing unless the hierarchy is naively apparent. I'd rather do composition. That said, I'd say no; there's too many pitfalls with allowing users to subclass a container. One such issue being that you then expose the internals of that container thereby breaking encapsulation. The alternative approach is with something like a concept, trait, or constrained generic.
Resolved by discussion and review in https://github.com/chapel-lang/chapel/pull/13030. Implementation coming soon!
Design question. Why is List.append not named as List.push so as to be the opposite of List.pop? It boggles my mind that append is the opposite of pop.
@BryantLam - we based the naming and interface on Python's list which has append and pop - see https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
Well that's annoying, but okay. :+1:
Most helpful comment
I like consistency. Whatever arrays do, do the same thing. If that behavior isn't desirable, then maybe arrays have the wrong behavior. Ideally, the [compiler] optimizations available to arrays should be applicable to these lists too.
Regarding default intents,
I'm not a great person to ask because I hate subclassing unless the hierarchy is naively apparent. I'd rather do composition. That said, I'd say no; there's too many pitfalls with allowing users to subclass a container. One such issue being that you then expose the internals of that container thereby breaking encapsulation. The alternative approach is with something like a concept, trait, or constrained generic.