update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.
xref #8627
xref #8643, #8350
Since we introduced Categorical
in 0.15.0, I think we have found 2 main uses.
1) as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
2) as a memory saving representation for object dtypes.
I could see introducting a dtype='string'
where String
is a slightly specialized sub-class of Categroical
, with 2 differences compared to a 'regular' Categorical:
Categorical
will complain if you do this:In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge
Note that this works if they are Series
(and prob should raise as well, side -issue)
But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).
string/unicode
(iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string'
e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.
We would then have a 'real' looking object dtype (and object
would be relegated to actual python object types, so would be used much less).
cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?
I think it would be a very nice improvement to have a real 'string' dtype in pandas.
So no longer having the confusion in pandas of object
dtype being actually in most cases a string, and sometimes a 'real' object
.
However, I don't know if this should be 'coupled' to categorical. Maybe that is only a technical implementation detail, but for me it should just be a string dtype, a dtype that holds string values, and has in essence nothing to do with categorical.
If I think about a string dtype, I am more thinking about numpy's strings types (but it has of course also impractialities, that is has fixed sizes), or the CHAR/VARCHAR in sql.
I'm of two minds about this. This could be quite useful, but on the other hand, it would be _way_ better if this could be done upstream in numpy or dynd. Pandas specific array types are not great for compatibility with the broader ecosystem.
I understand there are good reasons it may not be feasible to implement this upstream (#8350), but these solutions do feel very stop-gap. For example, if @teoliphant is right that dynd could be hooked up in the near future to replace numpy in pandas internals, I would be much more excited about exploring that possibility.
As for this specific proposal:
"string"
rather than "interned_string"
, unless we're sure interning is always a good idea. Also, libraries like dynd _do_ implement a true variable length string type (unlike numpy), and I think it is a good long term goal to align pandas dtypes with dtypes on the ndarray used for storage.factorize
.So I have tagged a related issue, about including integer NA support by using libdynd
(#8643). This will actuall be the first thing I do. (as its new and cool, and I think a slightly more straightforward path to include dynd as an optional dep).
@mwiebe
can you maybe explain a bit about the tradeoffs involved with representing strings in 2 ways using libdynd
cc @teoliphant
I've had in mind an intention to tweak the string representation in dynd slightly, and have written that up now. https://github.com/ContinuumIO/libdynd/issues/158 The vlen string in dynd does work presently, but it has slightly different properties than what I'm writing here.
Properties that this vlen string has are a 16 byte representation, using the small string optimization. This means strings with size <= 15 bytes encoded as utf-8 will fit in that memory. Bigger strings will involve a dynamic memory allocation per string, a little bit like Python's string, but with the utf-8 encoding and knowledge that it is a string instead of having to go through dynamic dispatch like in numpy object arrays of strings.
Representing strings as a dynd categorical is a bit more complicated, and wouldn't be dynamically updatable in the same way. The types in dynd are immutable, so a categorical type, once created, has a fixed memory layout, etc. This allows for optimized storage, e.g. if the total number of categories is <= 256, each element can be stored as one byte in the array, but does not allow the assignment of a new string that was not already a in the array of categories.
The issue mentioned in the last comment is now at https://github.com/libdynd/libdynd/issues/158
Is there any opinion to work this in 0.19? Hopefully I have some time during the summer:)
There are few comments in #13827, and I think it's OK if it can be done without breaking existing user's code. Though we may need some breaking change in 2.0, but the same limitation should be applied to Categorical
...
I think want to release 0.19.0 shortly (RC in couple of weeks). So let's slate this for next major release (which will be 1.0, rather than 0.20.0) I think.
yep, but let me try this weekend. of course it's ok to put it off to 1.0 if there is no time to review:)
@sinhrks hey I think a real-string pandas dtype would be great. would allow us to be much more string about object dtype.
How much work / additional code complexity would this require? I see this as a "nice to have" rather than something that adds fundamentally new functionality to the library
maybe @sinhrks can comment more here, but I think at the very least this allows for quite some code simplification. We will then know w/o having to contstantly infer whether something is all strings or includes actual objects
.
I think it could be done w/o changing much top-level API (e.g. adding another pandas dtype), we have most of this machinery already done.
My concern is that it may introduce new user APIs / semantics which may be in the line of fire for future API breakage. If the immediate _user_ benefits (vs. developer benefits) warrant this risk then it may be worth it
I worked a little for this, and currently expect minimum API change. Because it is being like a Categorical
which internally handles categories
and codes
automatically (user no need to care its internal repr).
I assume the impl consists from 2 parts, and mostly done by re-using / cleaning-up the current codes:
String
class which wraps .str
methods (this should simplify string.py
(Maybe replaced by a StringArray(?)
or its wrapper in the future).string
dtype (shares most of internal with Categorical
)I agree that we shouldn't force users/devs to unnecessary migration cost. I expect it can be achieved by minimizing Categorical
API breakage (it should also be applied to String
).
This is the first instance in a long time of changing the logical dtype under users' feet. The last (I think?) was the creation of DatetimeIndex
and adding datetime64[ns]
to the set of supported dtypes. I'm aware of pandas users that are still running on a fork of 0.7.x over this, if you can believe it.
So, this proposed change introduces a couple of immediate API breakages:
string_arr.dtype == np.object_
is now Falsestring_arr.values
is no longer an ndarray (is that right?)This alone makes this seem not really comparable to Categorical / DatetimeTZ (those were new types, not modifying existing types)
I don't think .values
should be affected; it will still return an object
array. This is handled with an indirection from the BlockManger (e.g. .external_values()
/ .internal_values()
).
we did this for datetime-tz:
In [3]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))
In [4]: s
Out[4]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
In [5]: s.values
Out[5]:
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')
In [6]: s._values
Out[6]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')
unfort the numpy-dtype vs string comparison is horribly broken IMHO by numpy: https://github.com/numpy/numpy/issues/5329
but to be honest its _already_ broken by category/datetimetz
, and ONLY for ==
on Series
. Of course we have the accessors and .select_dtypes
as the recommended ways.
Given all this evidence, I don't think this will impose on users any more than we already have, and will make some code quite a bit simpler as @sinhrks indicates.
unfort the numpy-dtype vs string comparison is horribly broken IMHO by numpy: numpy/numpy#5329
Sure, but that doesn't make it any less real for our users. Hopefully they use np.asarray
to convert to NumPy arrays, but quite possibly they have logic that checks dtypes and expects strings to use dtype=object
.
Without fixing this upstream, this sort of breakage is best handled by changing things all together in a pandas 2.0 release.
string_arr.dtype == np.object_
is now False
Yes, but worse np.dtype(object) == string_arr.dtype
will now be TypeError: data type not understood
.
@shoyer Yes this is broken upstream
but I don't see it ever fixed. Has been very little movement in numpy for fundamental things like this. This is in fact one of the reasons that pandas 2.0 needs to happen.
That said, I also do not see any good reason to hold back on changes which bring a better user experience.
Yes this is broken upstream but I don't see it ever fixed. Has been very little movement in numpy for fundamental things like this.
Things get fixed in NumPy when someone who cares (generally a downstream developer) gets involved and makes it happen, taking the time to get buy in from stakeholders on the mailing list.
That's how I was able to fix datetime64
(in NumPy 1.11) from "implicitly UTC" (with automatic conversion to local time zones when printed) to datetime native.
That said, I also do not see any good reason to hold back on changes which bring a better user experience.
Many parts of pd.String
would certainly be a better user experience. But just as certainly, the transition would cause pain for some users due to the API breakage. This is not an unambiguous win, especially if we are going to overhaul things again with pandas 2.0.
As another example, even if .values
still works by returning a new numpy array, there's no way to avoid breaking assignment to that array.
Things get fixed in NumPy when someone who cares (generally a downstream developer) gets involved and makes it happen, taking the time to get buy in from stakeholders on the mailing list.
That's how I was able to fix datetime64 (in NumPy 1.11) from "implicitly UTC" (with automatic conversion to local time zones when printed) to datetime native.
and that's great. But generally I think the downstream devs have way _less_ time as they have to contend with their own packages.
I have seen the frustration first hand, the glacial pace and endless discussion on the numpy mailing list. Surely they are trying to preserve backward compat and that is great. But this stifles things.
Holding back pandas with this standard in this way just makes people turn away out of frustration. Better performance, better compat, and features have been driving pandas for quite a while. Why stop now?
We _already_ provide much compat with numpy, but that does not mean that this should guide pandas direction _solely_ EVEN in the current pandas versions. 2.0 is going to take a while.
pandas forging ahead is a GREAT thing for the community. We go to great lengths to provide compatibility. sure pd.String
is not an unambiguous win, but I don't think _any_ changes are now-adays. There _always_ is a compat issue/argument.
As with all things, there are tradeoffs. Let's try to explicitly list out the concrete pros / user benefits (and code examples showing before/after if relevant) and also cons (API breaks, any changes to memory representation, etc.).
The benefits of having a separate dtype from object
are several fold:
pd.String
(the class), sub- classing (or maybe a super-class) of pd.Categorical
, provide quite a number of memory and performance benefitsCons:
.values
can coerce to an object
array for compat..values
will not work. Sure, but in the general case with a 2-d Frame, this generally doesn't work now. Certainly there are times it _can_ work. Further this case _has_ to go away. We have a set of indexers that already do all of this in a very clear way. providing multiple ways of doing an action is not very pythonic. This is a minor usecase and can be easily documented.I would propose: string[encoding]
, with the encoding being optional. (e.g. string
) is acceptable as a dtype.
Some would say that we should just wait for pandas 2.0. However a) this can lay the groundwork for the API change (in the dtype), and b) this may not be all that crazy to do; we have all of the machinery already existing.
related #13941 for Period[freq]
, and Boolean
types (much simpler to implement).
Just to chime in from my (limited) experience from helping with pd.Categorical
: that needed one release to introduce the new functionality and one additional major release to work out all the corner cases. While lots of corner cases regarding "encode objects/strings with int + lookup" are now guarded with is_categorical_dtype
(and so can be looked at and decided if they guard against "encode only" or a special case of "this is different for categoricals") is still suspect that a second release will be needed to iron out the corner cases. So IMO implementing it in the release which should become a long term release is quite a risk.
Several thoughts
pd.StringArray
a subclass of pd.Categorical
) at this late stage. Other extension dtypes have added new semantic functionality whereas this modifies existing functionality. The strongest argument I see is the more efficient / performant internal representation, but this is quite a rabbit hole. For example:In [1]: import pandas as pd
In [2]: cats = pd.Categorical.from_array(['foo', 'bar', 'baz'])
In [3]: cats
Out[3]:
[foo, bar, baz]
Categories (3, object): [bar, baz, foo]
In [4]: cats[2] = 'qux'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-4e3b1fefa183> in <module>()
----> 1 cats[2] = 'qux'
/home/wesm/miniconda/lib/python3.5/site-packages/pandas/core/categorical.py in __setitem__(self, key, value)
1609 # something to np.nan
1610 if len(to_add) and not isnull(to_add).all():
-> 1611 raise ValueError("Cannot setitem on a Categorical with a new "
1612 "category, set the categories first")
1613
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
If we do a dictionary-encoded representation in pandas 2.0, then we will have to deal with this as well, and this implementation work will likely be duplicated. We also don't necessarily know what kinds of performance regressions this will introduce into user code that does a lot of string array mutation
In [7]: df = pd.DataFrame({'periods': pd.period_range('2000-01-01', periods=10)})
In [8]: df.dtypes
Out[8]:
periods object
dtype: object
In [9]: df['periods'][0]
Out[9]: Period('2000-01-01', 'D')
but the number of users who are depending on this current behavior / representation seems limited compared with strings (which are used by effectively every pandas user).
I would rather be conservative here (given that strings along with floating point numbers are probably the two most important types of data right now used by pandas users) and invest our energies designing a more future-proof foundation in the 2.x development branch (and where we will have enough time to eat the dog food and fix any mistakes before users are impacted).
@jreback Thanks for outlining pros/cons. I think it's important to consider that most of the advantages here will be short lived / obsoleted by pandas 2.0. In contrast, the downsides of yet-another data-type migration are quite real (especially if we need to do more fix-ups later). When we make this change, it is quite likely to break many downstream applications and libraries.
What would valid choices for encoding
be? Just ascii
and utf8
, or the full range of valid Python encodings? From a forward thinking perspective, I would suggest dropping encoding
and requiring that all strings be unicode/UTF-8 (like Python 3).
So when I originally wrote this I was actually going to separate this into 2 stages. A creation of the string
dtype, and separately the change in the underlying repr (to a pd.String
that was categorical based).
My goal was multi-fold.
The dtype change is really for code-cleanup internally. This in-and-of-itself prob does not justify the cost of its changes, but it moves us in the direction of pandas 2.0. I acutally think this is a very very important point. Just saying pandas 2.0 seems like it is _around-the-corner_. But we all know that it is at the very least 1 year aways from a stable back-compat release.
I don't see why pandas 1.x should slow down / stop, EVEN IF we have a further API change. Here's the crucial point. I think any attempt to make a 'BIG' leap (aka py2/py3) is just a complete disaster and should be avoided at all costs. Including, and up to, multiple 'smaller' API breaks.
This gives people time to adjust gradually. The more gradual the better. Since pandas 2.0 will be a user API change in _maybe_ 1 year. Having one in 3 months which will do the builk of the changes anyhow, is IMHO, beneficial, NOT detrimental.
As far as the details, as @shoyer points out. Ideally we could spec this out to be 'about' what pandas 2.0 needs. So I would support string[ascii]
for compat and string[utf-8]
, where string
== string[utf-i]
. Again these would prob just be a 'display' dtype, e.g.
In [1]: Series(list('abc'))
Out[1]:
0 a
1 b
2 c
dtype: string
In [2]: Series([u'a', u'b', u'c'])
Out[2]:
0 a
1 b
2 c
dtype: string[utf8]
If the impl is actually a rabbit hole (and to be honest it will STILL need to be addressed, but of course that could be later). This is such a big win for memory usage, that I would push for this in 1.x. That is why I am pushing for the dtype change, with an attendant breakage.
In fact, better to do it now, to see how it shakes out in reality. What better test bed that current pandas?
I don't see why pandas 1.x should slow down / stop, EVEN IF we have a further API change
I have a hard time believing that a major refactor of pandas's internals can ever succeed if pandas 1.x does not commit to strict API stability. If you are against creating a production / fully API-stable maintenance branch, perhaps we should return to that discussion on the mailing list.
To me this feels like one of the most sensitive changes that the library has seen in a long time, on par with the datetime64 work from pandas 0.7 to pandas 0.8, and so I'm not confident that we can get it right on the first try. My gut feeling is that it will affect users in many unknown ways in the long tail, and we won't get that feedback until releasing the change in a major release, which will foil the plan of making a API-stable major release.
As far as the details, as @shoyer points out. Ideally we could spec this out to be 'about' what pandas 2.0 needs.
I'm also very concerned about ending up with a bolted-on solution (that we feel some obligation to stick with) before we have a chance to really dig in and see what uniform, self-contained metadata / logical types looks like practically-speaking for users (which will likely take some iteration, why add constraints now?).
I would personally also go the more conservative route for pandas 1.0, for the following reasons:
If we would like to add this functionality now, I would rather go the 'opt-in' route (if this is possible), and not using it by default for string columns. We _could_ rather easily make a version of the Categorical without the strict checks on the categories, and provide this for users _now_ that want a way to have a more performant string type, without the strictness of current Categorical.
But of course, then we don't have the advantage of simpler code paths internally that @jreback listed above.
ok @wesm and @jorisvandenbossche you make some valid points. So will move this to 2.0 milestone. All that said, if during discussions, it looks like pandas 2.0 will be signficantly delayed, e.g. more than 1 year out. Then we ought to reconsider non-trivial API changes.
We should be really clear what 'freezing' the API actually means.
We should be really clear what 'freezing' the API actually means.
Yes, let's start a separate discussion for this. Start a discussion on the pandas-dev list first? (or an issue is fine for me as well). I will try to formulate some initial thoughts this evening.
Yes let's take that discussion to the mailing list.
@cpcloud and I were talking about #19520.
This would allow this to proceed. IOW having an external library (pyarrow) to manage the memory of an array of strings. We could then defer all of the ops to the array extension.
would save massively on memory and be quite performant.
cc @TomAugspurger @wesm @jorisvandenbossche
I think that's worth exploring (more generally, I think the extension array stuff offers a decent way of trialing pyarrow-backed things in the pandas-1 codebase).
Would this be transparent to users, or would it be a third-party library, and they'd be required to somehow create an array-backed array, which would then be stored in pandas?
I think it'd be interesting to explore -- since we don't yet have a native operator library for Arrow string arrays, it seems limited usefulness for any analytics, and probably won't make things any faster right now (since copies of the strings as PyBytes/PyUnicode would have to be materialized to do any computations)
I'm going to have a shot at this using ExtensionArrays, pyarrow
and numba
. Mainly to see what the combination of the three produces make possible, so only expect a prototype. Looking at some simple operations like startswith
, one can already see a bit better performance then with the current object arrays for strings.
One of the limitations of the Arrow and especially the memory layout of its string array container is that you will not be able to do in-place operations. I guess for a lot of operations this should not be problematic but will lead to copies in the cases a user does df.loc[25:30, 'str_col']Â = "some string"
.
@xhochy that would be awesome!
i don’t think in place modification of strings is actually that big of a deal - at worst it’s really a perf issue
nice library by @xhochy https://github.com/xhochy/fletcher
I'm interested in implementing this sometime this year.
Using Arrow as the memory layout seems like the right choice. So I think the main question is how to implement the string algorithms (startswith, extract, etc.).
In fletcher @xhochy is using numba to quickly re-implement these algorithms.
@wesm opened https://issues.apache.org/jira/browse/ARROW-555 for adding these kinds of algorithms to Arrow's C++ library.
@xhochy, @wesm at this point in time, where would you recommend investing development time if I were to find some? Was the choice of numba for fletcher just for getting a prototype together?
The choice for numba
in fletcher
was for quick prototyping and for me to understand what needs to be done to make pyarrow
better accessible without the need to resort to C++ in Python packages. For the final implementation, we should add the string algorithms to the C++ Arrow library to make them also accessible for R, Ruby and friends.
Meanwhile, I think that using numba
in fletcher
will be a good way for us to implement some of the algorithms in a bit-more-productive-than-prototype manner and we can then gradually move them into core Arrow (which will be a bit more work). But I expect that we also will be limited for now on what we can do with Arrow structures in numba
. An important addition to numba
which has kept me a bit from working on fletcher
is that it now supports dictionaries.
Thanks for the context. For now, I think development can proceed along a
few lines simultaneously
We'll have additional items to discuss like how to make Apache Arrow the
default memory for text data (a pandas 2.x discussion probably),
and the development / maintenance of fletcher (if this approach works out,
I suspect the pandas devs would be interested in maintaining things),
but those can wait.
On Tue, Apr 23, 2019 at 4:20 AM Uwe L. Korn notifications@github.com
wrote:
The choice for numba in fletcher was for quick prototyping and for me to
understand what needs to be done to make pyarrow better accessible
without the need to resort to C++ in Python packages. For the final
implementation, we should add the string algorithms to the C++ Arrow
library to make them also accessible for R, Ruby and friends.Meanwhile, I think that using numba in fletcher will be a good way for us
to implement some of the algorithms in a bit-more-productive-than-prototype
manner and we can then gradually move them into core Arrow (which will be a
bit more work). But I expect that we also will be limited for now on what
we can do with Arrow structures in numba. An important addition to numba
which has kept me a bit from working on fletcher is that it now supports
dictionaries.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/8640#issuecomment-485722262,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOITUB7BPHU5MX5BJ6ITPR3IGNANCNFSM4AWMSYDA
.
Our general development mantra is optimizing for code reuse. We have a pretty healthy collaboration going with the R and Ruby communities so to implement once and use in 3 different binding layers is pretty powerful
What are the desired algorithms on a string array? I understand Pandas exposes the methods of Python strings, but are those actually useful for columnar work, or are other primitives more important?
AFAIK these methods are pretty popular, a pandas user would often write say df[col].str.lower().str.count("e")
As an exercise for our roadmap (https://github.com/pandas-dev/pandas/pull/27478) I wrote a proposal for adding a string extension type to pandas: https://hackmd.io/@TomAugspurger/Hyuaby6fr That addresses the user-facing API. It explicitly doesn't change the memory representation (though it does enable a future Arrow-backed StringArray, since the actual data would be a private implementation detail).
One thing that came to my mind while reading through the roadmap comments:
A vectorised/efficient string type will probably always be immutable in its storage, this is completely different to the current Pandas semantics.
The main difference for strings in comparison to numeric data is that single row entries are not of a fixed size. Thus storing the strings in a continuous section of memory cannot always guarantee in-place mutability. In the case of the Arrow storage layout, where we store all strings in a contiguous, non-spaced way, you can only replace string values with strings of the exact same size. In the NumPy version, where you have a fixed size for all rows, you are wasting more memory but are able to do in-place replacement with strings of a smaller size but still need to reallocate when you want to insert strings that are larger than your current ones.
In fletcher
we're working around this at the moment by creative slicing to still provide a semi-efficient mutable API but this is not a good long-term solution: https://github.com/xhochy/fletcher/blob/fe6b7fd00d9f3224df3fdb44b1bd6189c5dc3517/fletcher/base.py#L250-L328
_But in the end, this will lead to a different API experience to the pandas end-user._
Thanks for that write-up Tom!
@xhochy wouldn't it be possible to provide the same end-user experience of mutability as we have now?
When doing mutations, you would indeed need to create a new buffer, copying the existing strings while inserting the ones you want to mutate. For sure, this will decrease performance of mutating (and certainly if you mutate one by one in a for loop). But that might be a worthy trade-off for better memory user / more performant algorithms (which I think will benefit more people than efficient mutation).
In such a case, we would need to build a set of tools to do "batch mutations" still relatively efficiently (eg a replace like method or a "put" with a bunch of values to set).
@xhochy wouldn't it be possible to provide the same end-user experience of mutability as we have now?
Yes, just with a different performance feel as you described.
I wonder if it makes sense to have a stringarray module for Python, that uses the arrow spec but does not have an arrow dependency. Pandas and vaex could use that, or other projects that work with arrays of strings.
In vaex, almost all of the string operations are implemented in C++ (utf8 support and regex), it would not be a bad idea to split off that library. The code needs a cleanup, but it's pretty well tested, and pretty fast: https://towardsdatascience.com/vaex-a-dataframe-with-super-strings-789b92e8d861
I don't have a ton of resources to put in this, but I think it will not cost me much time. If there is serious interest in this (someone from pandas wants to do the pandas part), I'm happy to put in some hours.
Ideally, I'd like to see a clean c++ header-only library, that this library (pystringarray) and arrow could use, possibly build on xtensor (cc @SylvainCorlay @wolfv), but that can be considered an implementation default (as long as the API and the memory model stay the same).
I think Arrow also plans to have some string processing methods at some point, and would welcome contributions. So that could also be a place to have such functionality live.
But you explicitly mention a library compatible with but not dependent on Arrow? In Vaex, Arrow is already a dependency, or only optional? Do you think of potential use cases / users that would be interested in this, but where an Arrow dependency is a problem? (it's a heavy dependency for sure)
In vaex-core, currently (because we were future compatible due to 32bit limitation) we are not depending on arrow, although the string memory layout is arrow compatible. The vaex-arrow package is required for loading/writing with arrow files/streams, so it's an optional dependency, vaex-core does not need it.
I think now, we could have a pyarrow dependency for vaex-core, although we'll inherit all installation issue that might come with it (not much experience with it), so I'm still not 100% sure (I read there were windows wheel issues).
But the same approach can be used by other libraries, such as a hypothetical pystringarray package, which would follow the arrow spec, and expose its buffers, but not have a direct pyarrow dependency.
Another approach, discussed with @xhochy is to have a c++ library (c++ could use a header only string and stringarray library), possibly build in xtensor or compatible with. This library could be something that arrow could use, and possible pystringarray could use.
My point is, I think if general algorithms (especially string algos) go into arrow, it will be 'lost' for use outside of arrow, because it's such a big dependency.
Arrow is only a large dependency if you build all the optional components. I'm concerned there's some FUD being spread here about this topic -- I think it is important to develop a collaborative community that is working together on this (with open community governance) and ensure that downstream consumers can reuse the code that they need without being burdened by optional dependencies.
We are taking two measures in Apache Arrow to make it easier for third party projects to take on the project as a dependency:
Reducing build-time dependencies for the C++ core library to zero. There aren't many dependencies anyway but some projects have taken the position that taking on even a single transitive build dependency (for example, the Flatbuffers compiler) is unacceptable. You can follow the work and pitch in at https://issues.apache.org/jira/browse/ARROW-6637. For the time being we will continue to make the pyarrow
package more comprehensive -- if more people get involved in the project we can work to modularize the Python package to enable more piecemeal installation.
Providing a "C protocol" ABI for two libraries sharing no code to nonetheless expose Arrow data structures to each other in-process without any serialization (and without having to generate the Arrow binary protocol). You can see the discussion here https://lists.apache.org/thread.html/462143a1062ad34be529c84eccacf46d0c5c92b607dbd34f6c8bbeb3@%3Cdev.arrow.apache.org%3E
There is a bit of a divide between people who are uncomfortable with e.g. having second-order dependencies, and being uncomfortable with a large monolithical dependency.
Having a large tree of dependencies between small packages is very well adressed by a package manager. It allows a separation of concerns between components, and the teams developing them, as soon as APIs and extension points are well-defined. This has been the path of Project Jupyter since the Big Split (tm). Monolithical projects make me somewhat more uncomfortable in general. I rarelly am interested in everything in a large monolithical project...
The way we have been doing stuff in the xtensor stack is recommending the use of a package manager. We maintain the the conda packages but xtensor packages have been packaged for Fedora, Arch Linux etc.
I assure you that we hear your concerns and we will do everything we can to address them in time but it will not happen overnight. Our top priority is ensuring that our developer/contributor community is as productive as possible. Based on our contribution graph I would say we have done a good job of this.
The area where we have made the most progress on modular installs actually is in our .deb and .yum packages.
https://github.com/apache/arrow/tree/master/dev/tasks/linux-packages/debian
With recent improvements to conda / conda-forge, we can similarly achieve modularization, at least at the C++ package level.
To have modular Python installs will not be easy. We need help from more people to figure out how to address this from a tooling perspective. The current solution is optimized for developer productivity, so we have to make sure that any changes that are made to the packaging process don't make things much more difficult for contributors.
So until this enhancement is implemented (and adopted by most users via upgrading the library), what is the fastest way to check if a series with dtype object only consists of strings?
For example, I have the following series with dtype object and want to detect if there are any non-string values:
series = pd.Series(["string" for i in range(1_000)])
series.loc[0] = 1
def series_has_nonstring_values(series):
# TODO: how to implement this efficiently?
return False
assert series_has_nonstring_values(series) is True
I hope that this is the right place to address this issue/question?
@8080labs with the current public API, you can use infer_dtype
for this:
In [48]: series = pd.Series(["string" for i in range(1_000)])
In [49]: pd.api.types.infer_dtype(series, skipna=True)
Out[49]: 'string'
In [50]: series.loc[0] = 1
In [51]: pd.api.types.infer_dtype(series, skipna=True)
Out[51]: 'mixed-integer'
There is a faster is_string_array
, but that is not public, but will be exposed indirectly through the string dtype that will be included in 1.0: https://github.com/pandas-dev/pandas/pull/27949
closed via #27949
There is still relevant discussion here on the second part of this enhancement: a native storage (Tom also updated the top comment to reflect this)
After learning more about the goal of Apache Arrow, vaex will happily depend on it in the (near?) future.
I want to ignore the discussion on where the c++ string library code should live (in or outside arrow), not to get sidetracked.
I'm happy to spend a bit of my time to see if I can move algorithms and unit tests to Apache Arrow, but it would be good if some pandas/arrow devs could assist me a bit (I believe @xhochy offered me help once, does that offer still stand?).
Vaex's string API is modeled on Pandas (80-90% compatible), so my guess is that Pandas should be able to make use of this move to Arrow, since it could simply forward many of the string method calls directly to Arrow once the algorithms are moved.
In short:
Thanks for the update @maartenbreddels.
Speaking for myself (not pandas-dev) I don't have a strong opinion on where these algorithms should live. I think pandas will find a way to use them regardless. Putting them in Arrow is probably convenient since we're dancing around a hard dependency on pyarrow in a few places.
I may be wrong, but I don't think any of the core pandas maintainers has C++ experience. One of us could likely help with the Python bindings though, if that'd be helpful.
I opened https://github.com/pandas-dev/pandas/issues/35169 for discussing how we can expose an Arrow-backed StringArray to users.
Most helpful comment
We are taking two measures in Apache Arrow to make it easier for third party projects to take on the project as a dependency:
Reducing build-time dependencies for the C++ core library to zero. There aren't many dependencies anyway but some projects have taken the position that taking on even a single transitive build dependency (for example, the Flatbuffers compiler) is unacceptable. You can follow the work and pitch in at https://issues.apache.org/jira/browse/ARROW-6637. For the time being we will continue to make the
pyarrow
package more comprehensive -- if more people get involved in the project we can work to modularize the Python package to enable more piecemeal installation.Providing a "C protocol" ABI for two libraries sharing no code to nonetheless expose Arrow data structures to each other in-process without any serialization (and without having to generate the Arrow binary protocol). You can see the discussion here https://lists.apache.org/thread.html/462143a1062ad34be529c84eccacf46d0c5c92b607dbd34f6c8bbeb3@%3Cdev.arrow.apache.org%3E