Pandas: API/ENH: dtype='string' / pd.String

Created on 26 Oct 2014 · 61Comments · Source: pandas-dev/pandas

update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.

xref #8627
xref #8643, #8350

Since we introduced Categorical in 0.15.0, I think we have found 2 main uses.

1) as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
2) as a memory saving representation for object dtypes.

I could see introducting a dtype='string' where String is a slightly specialized sub-class of Categroical, with 2 differences compared to a 'regular' Categorical:

it allows unions of arbitrary other string types, currently Categorical will complain if you do this:

In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge

Note that this works if they are Series (and prob should raise as well, side -issue)

But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).

you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to string/unicode (iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.

I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string' e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.

We would then have a 'real' looking object dtype (and object would be relegated to actual python object types, so would be used much less).

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?

API Design Categorical Enhancement ExtensionArray Performance Strings

Source

jreback

👍2

Most helpful comment

We are taking two measures in Apache Arrow to make it easier for third party projects to take on the project as a dependency:

Reducing build-time dependencies for the C++ core library to zero. There aren't many dependencies anyway but some projects have taken the position that taking on even a single transitive build dependency (for example, the Flatbuffers compiler) is unacceptable. You can follow the work and pitch in at https://issues.apache.org/jira/browse/ARROW-6637. For the time being we will continue to make the pyarrow package more comprehensive -- if more people get involved in the project we can work to modularize the Python package to enable more piecemeal installation.
Providing a "C protocol" ABI for two libraries sharing no code to nonetheless expose Arrow data structures to each other in-process without any serialization (and without having to generate the Arrow binary protocol). You can see the discussion here https://lists.apache.org/thread.html/462143a1062ad34be529c84eccacf46d0c5c92b607dbd34f6c8bbeb3@%3Cdev.arrow.apache.org%3E

wesm on 20 Sep 2019

👍3

All 61 comments

I think it would be a very nice improvement to have a real 'string' dtype in pandas.
So no longer having the confusion in pandas of object dtype being actually in most cases a string, and sometimes a 'real' object.

However, I don't know if this should be 'coupled' to categorical. Maybe that is only a technical implementation detail, but for me it should just be a string dtype, a dtype that holds string values, and has in essence nothing to do with categorical.

If I think about a string dtype, I am more thinking about numpy's strings types (but it has of course also impractialities, that is has fixed sizes), or the CHAR/VARCHAR in sql.

jorisvandenbossche on 27 Oct 2014

I'm of two minds about this. This could be quite useful, but on the other hand, it would be _way_ better if this could be done upstream in numpy or dynd. Pandas specific array types are not great for compatibility with the broader ecosystem.

I understand there are good reasons it may not be feasible to implement this upstream (#8350), but these solutions do feel very stop-gap. For example, if @teoliphant is right that dynd could be hooked up in the near future to replace numpy in pandas internals, I would be much more excited about exploring that possibility.

As for this specific proposal:

Would we really use this in place of object dtype for almost _all_ string data in pandas? If so, this needs to meet a much higher standard than if it's merely an option.
It would be premature to call this the dtype "string" rather than "interned_string", unless we're sure interning is always a good idea. Also, libraries like dynd _do_ implement a true variable length string type (unlike numpy), and I think it is a good long term goal to align pandas dtypes with dtypes on the ndarray used for storage.
The worst of the performance consequences might be avoided if we do not guarantee that the string "categories" are unique. Otherwise every str op requires a call to factorize.
Especially if this is the default/standard, I really think we should try to make it work for N-dimensional data (I still need to finish up my patch for categorical).

shoyer on 27 Oct 2014

So I have tagged a related issue, about including integer NA support by using libdynd (#8643). This will actuall be the first thing I do. (as its new and cool, and I think a slightly more straightforward path to include dynd as an optional dep).

@mwiebe

can you maybe explain a bit about the tradeoffs involved with representing strings in 2 ways using libdynd

as a libdynd categorical (like proposing above but using the native categorical type which DOES exist in libdynd currently)
as vlen strings (another libdynd feature that DOES exist).

cc @teoliphant

jreback on 27 Oct 2014

I've had in mind an intention to tweak the string representation in dynd slightly, and have written that up now. https://github.com/ContinuumIO/libdynd/issues/158 The vlen string in dynd does work presently, but it has slightly different properties than what I'm writing here.

Properties that this vlen string has are a 16 byte representation, using the small string optimization. This means strings with size <= 15 bytes encoded as utf-8 will fit in that memory. Bigger strings will involve a dynamic memory allocation per string, a little bit like Python's string, but with the utf-8 encoding and knowledge that it is a string instead of having to go through dynamic dispatch like in numpy object arrays of strings.

Representing strings as a dynd categorical is a bit more complicated, and wouldn't be dynamically updatable in the same way. The types in dynd are immutable, so a categorical type, once created, has a fixed memory layout, etc. This allows for optimized storage, e.g. if the total number of categories is <= 256, each element can be stored as one byte in the array, but does not allow the assignment of a new string that was not already a in the array of categories.

mwiebe on 31 Oct 2014

The issue mentioned in the last comment is now at https://github.com/libdynd/libdynd/issues/158

jankatins on 3 Mar 2015

Is there any opinion to work this in 0.19? Hopefully I have some time during the summer:)

There are few comments in #13827, and I think it's OK if it can be done without breaking existing user's code. Though we may need some breaking change in 2.0, but the same limitation should be applied to Categorical...

sinhrks on 4 Aug 2016

I think want to release 0.19.0 shortly (RC in couple of weeks). So let's slate this for next major release (which will be 1.0, rather than 0.20.0) I think.

jreback on 4 Aug 2016

yep, but let me try this weekend. of course it's ok to put it off to 1.0 if there is no time to review:)

sinhrks on 4 Aug 2016

@sinhrks hey I think a real-string pandas dtype would be great. would allow us to be much more string about object dtype.

jreback on 4 Aug 2016

How much work / additional code complexity would this require? I see this as a "nice to have" rather than something that adds fundamentally new functionality to the library

wesm on 9 Aug 2016

maybe @sinhrks can comment more here, but I think at the very least this allows for quite some code simplification. We will then know w/o having to contstantly infer whether something is all strings or includes actual objects.

I think it could be done w/o changing much top-level API (e.g. adding another pandas dtype), we have most of this machinery already done.

jreback on 9 Aug 2016

My concern is that it may introduce new user APIs / semantics which may be in the line of fire for future API breakage. If the immediate _user_ benefits (vs. developer benefits) warrant this risk then it may be worth it

wesm on 9 Aug 2016

I worked a little for this, and currently expect minimum API change. Because it is being like a Categorical which internally handles categories and codes automatically (user no need to care its internal repr).

I assume the impl consists from 2 parts, and mostly done by re-using / cleaning-up the current codes:

String class which wraps .str methods (this should simplify string.py (Maybe replaced by a StringArray(?) or its wrapper in the future).
string dtype (shares most of internal with Categorical)

I agree that we shouldn't force users/devs to unnecessary migration cost. I expect it can be achieved by minimizing Categorical API breakage (it should also be applied to String).

sinhrks on 10 Aug 2016

This is the first instance in a long time of changing the logical dtype under users' feet. The last (I think?) was the creation of DatetimeIndex and adding datetime64[ns] to the set of supported dtypes. I'm aware of pandas users that are still running on a fork of 0.7.x over this, if you can believe it.

So, this proposed change introduces a couple of immediate API breakages:

string_arr.dtype == np.object_ is now False
string_arr.values is no longer an ndarray (is that right?)

This alone makes this seem not really comparable to Categorical / DatetimeTZ (those were new types, not modifying existing types)

wesm on 10 Aug 2016

I don't think .values should be affected; it will still return an object array. This is handled with an indirection from the BlockManger (e.g. .external_values() / .internal_values()).

we did this for datetime-tz:

In [3]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))

In [4]: s
Out[4]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [5]: s.values
Out[5]: 
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')

In [6]: s._values
Out[6]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00', '2013-01-03 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq='D')

unfort the numpy-dtype vs string comparison is horribly broken IMHO by numpy: https://github.com/numpy/numpy/issues/5329

but to be honest its _already_ broken by category/datetimetz, and ONLY for == on Series. Of course we have the accessors and .select_dtypes as the recommended ways.

Given all this evidence, I don't think this will impose on users any more than we already have, and will make some code quite a bit simpler as @sinhrks indicates.

jreback on 10 Aug 2016

unfort the numpy-dtype vs string comparison is horribly broken IMHO by numpy: numpy/numpy#5329

Sure, but that doesn't make it any less real for our users. Hopefully they use np.asarray to convert to NumPy arrays, but quite possibly they have logic that checks dtypes and expects strings to use dtype=object.

Without fixing this upstream, this sort of breakage is best handled by changing things all together in a pandas 2.0 release.

string_arr.dtype == np.object_ is now False

Yes, but worse np.dtype(object) == string_arr.dtype will now be TypeError: data type not understood.

shoyer on 10 Aug 2016

@shoyer Yes this is broken upstream
but I don't see it ever fixed. Has been very little movement in numpy for fundamental things like this. This is in fact one of the reasons that pandas 2.0 needs to happen.

That said, I also do not see any good reason to hold back on changes which bring a better user experience.

jreback on 10 Aug 2016

Yes this is broken upstream but I don't see it ever fixed. Has been very little movement in numpy for fundamental things like this.

Things get fixed in NumPy when someone who cares (generally a downstream developer) gets involved and makes it happen, taking the time to get buy in from stakeholders on the mailing list.

That's how I was able to fix datetime64 (in NumPy 1.11) from "implicitly UTC" (with automatic conversion to local time zones when printed) to datetime native.

That said, I also do not see any good reason to hold back on changes which bring a better user experience.

Many parts of pd.String would certainly be a better user experience. But just as certainly, the transition would cause pain for some users due to the API breakage. This is not an unambiguous win, especially if we are going to overhaul things again with pandas 2.0.

As another example, even if .values still works by returning a new numpy array, there's no way to avoid breaking assignment to that array.

shoyer on 10 Aug 2016

Things get fixed in NumPy when someone who cares (generally a downstream developer) gets involved and makes it happen, taking the time to get buy in from stakeholders on the mailing list.

That's how I was able to fix datetime64 (in NumPy 1.11) from "implicitly UTC" (with automatic conversion to local time zones when printed) to datetime native.

and that's great. But generally I think the downstream devs have way _less_ time as they have to contend with their own packages.

I have seen the frustration first hand, the glacial pace and endless discussion on the numpy mailing list. Surely they are trying to preserve backward compat and that is great. But this stifles things.

Holding back pandas with this standard in this way just makes people turn away out of frustration. Better performance, better compat, and features have been driving pandas for quite a while. Why stop now?

We _already_ provide much compat with numpy, but that does not mean that this should guide pandas direction _solely_ EVEN in the current pandas versions. 2.0 is going to take a while.

pandas forging ahead is a GREAT thing for the community. We go to great lengths to provide compatibility. sure pd.String is not an unambiguous win, but I don't think _any_ changes are now-adays. There _always_ is a compat issue/argument.

jreback on 10 Aug 2016

As with all things, there are tradeoffs. Let's try to explicitly list out the concrete pros / user benefits (and code examples showing before/after if relevant) and also cons (API breaks, any changes to memory representation, etc.).

wesm on 10 Aug 2016

The benefits of having a separate dtype from object are several fold:

transparency:
It is then clear to the user what is held in the dtype (including encoding, see below)
simpler code path:
code that can assume strings only, can be simpler, raising better error messages to the user
performance:
specific typing allows for an optimization; backing Strings by pd.String (the class), sub- classing (or maybe a super-class) of pd.Categorical, provide quite a number of memory and performance benefits
forward compat:
should be pretty compatibile with pandas 2.0 philosophy / user API

Cons:

some incompatiblities. For the most part this can be pretty transparent. We go to great lengths to avoid breaking API. I think that could be done here as well, e.g. .values can coerce to an object array for compat.
There is a claim that assignment to .values will not work. Sure, but in the general case with a 2-d Frame, this generally doesn't work now. Certainly there are times it _can_ work. Further this case _has_ to go away. We have a set of indexers that already do all of this in a very clear way. providing multiple ways of doing an action is not very pythonic. This is a minor usecase and can be easily documented.

I would propose: string[encoding], with the encoding being optional. (e.g. string) is acceptable as a dtype.

Some would say that we should just wait for pandas 2.0. However a) this can lay the groundwork for the API change (in the dtype), and b) this may not be all that crazy to do; we have all of the machinery already existing.

related #13941 for Period[freq], and Boolean types (much simpler to implement).

jreback on 15 Aug 2016

Just to chime in from my (limited) experience from helping with pd.Categorical: that needed one release to introduce the new functionality and one additional major release to work out all the corner cases. While lots of corner cases regarding "encode objects/strings with int + lookup" are now guarded with is_categorical_dtype (and so can be looked at and decided if they guard against "encode only" or a special case of "this is different for categoricals") is still suspect that a second release will be needed to iron out the corner cases. So IMO implementing it in the release which should become a long term release is quite a risk.

jankatins on 15 Aug 2016

Several thoughts

I'm concerned about changing the memory representation of strings (such as making pd.StringArray a subclass of pd.Categorical) at this late stage. Other extension dtypes have added new semantic functionality whereas this modifies existing functionality. The strongest argument I see is the more efficient / performant internal representation, but this is quite a rabbit hole. For example:

In [1]: import pandas as pd

In [2]: cats = pd.Categorical.from_array(['foo', 'bar', 'baz'])

In [3]: cats
Out[3]: 
[foo, bar, baz]
Categories (3, object): [bar, baz, foo]

In [4]: cats[2] = 'qux'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-4e3b1fefa183> in <module>()
----> 1 cats[2] = 'qux'

/home/wesm/miniconda/lib/python3.5/site-packages/pandas/core/categorical.py in __setitem__(self, key, value)
   1609         # something to np.nan
   1610         if len(to_add) and not isnull(to_add).all():
-> 1611             raise ValueError("Cannot setitem on a Categorical with a new "
   1612                              "category, set the categories first")
   1613 

ValueError: Cannot setitem on a Categorical with a new category, set the categories first

If we do a dictionary-encoded representation in pandas 2.0, then we will have to deal with this as well, and this implementation work will likely be duplicated. We also don't necessarily know what kinds of performance regressions this will introduce into user code that does a lot of string array mutation

While this does push the pandas API in the direction of the desired pandas 2.0 outcome (i.e. consistent and self-contained metadata across the board, rather than our current "franken-metadata" numpy-pandas hybrid), it does expand the API surface area, and the API may need to be changed / broken (again) in the pandas 2.0 refactor. Adding a Period dtype (as with Categorical / DatetimeTZ) does little harm because it will not affect as much existing working code. Of course, we currently have:

In [7]: df = pd.DataFrame({'periods': pd.period_range('2000-01-01', periods=10)})

In [8]: df.dtypes
Out[8]: 
periods    object
dtype: object

In [9]: df['periods'][0]
Out[9]: Period('2000-01-01', 'D')

but the number of users who are depending on this current behavior / representation seems limited compared with strings (which are used by effectively every pandas user).

Introducing encoding metadata feels like a deep rabbit hole that will introduce complexity both for users and developers. My preference as per discussions in #13944 is to support UTF-8 and Binary data, but nothing else. If you support an byte-encoded / non-UTF-8 string type, then encoding inference or validation (e.g. using https://pypi.python.org/pypi/chardet and/or stdlib codecs) becomes necessary.

I would rather be conservative here (given that strings along with floating point numbers are probably the two most important types of data right now used by pandas users) and invest our energies designing a more future-proof foundation in the 2.x development branch (and where we will have enough time to eat the dog food and fix any mistakes before users are impacted).

wesm on 16 Aug 2016

@jreback Thanks for outlining pros/cons. I think it's important to consider that most of the advantages here will be short lived / obsoleted by pandas 2.0. In contrast, the downsides of yet-another data-type migration are quite real (especially if we need to do more fix-ups later). When we make this change, it is quite likely to break many downstream applications and libraries.

What would valid choices for encoding be? Just ascii and utf8, or the full range of valid Python encodings? From a forward thinking perspective, I would suggest dropping encoding and requiring that all strings be unicode/UTF-8 (like Python 3).

shoyer on 16 Aug 2016

So when I originally wrote this I was actually going to separate this into 2 stages. A creation of the string dtype, and separately the change in the underlying repr (to a pd.String that was categorical based).

My goal was multi-fold.

The dtype change is really for code-cleanup internally. This in-and-of-itself prob does not justify the cost of its changes, but it moves us in the direction of pandas 2.0. I acutally think this is a very very important point. Just saying pandas 2.0 seems like it is _around-the-corner_. But we all know that it is at the very least 1 year aways from a stable back-compat release.

I don't see why pandas 1.x should slow down / stop, EVEN IF we have a further API change. Here's the crucial point. I think any attempt to make a 'BIG' leap (aka py2/py3) is just a complete disaster and should be avoided at all costs. Including, and up to, multiple 'smaller' API breaks.

This gives people time to adjust gradually. The more gradual the better. Since pandas 2.0 will be a user API change in _maybe_ 1 year. Having one in 3 months which will do the builk of the changes anyhow, is IMHO, beneficial, NOT detrimental.

As far as the details, as @shoyer points out. Ideally we could spec this out to be 'about' what pandas 2.0 needs. So I would support string[ascii] for compat and string[utf-8], where string == string[utf-i]. Again these would prob just be a 'display' dtype, e.g.

In [1]: Series(list('abc'))
Out[1]: 
0    a
1    b
2    c
dtype: string

In [2]: Series([u'a', u'b', u'c'])
Out[2]: 
0    a
1    b
2    c
dtype: string[utf8]

If the impl is actually a rabbit hole (and to be honest it will STILL need to be addressed, but of course that could be later). This is such a big win for memory usage, that I would push for this in 1.x. That is why I am pushing for the dtype change, with an attendant breakage.

In fact, better to do it now, to see how it shakes out in reality. What better test bed that current pandas?

jreback on 16 Aug 2016

I don't see why pandas 1.x should slow down / stop, EVEN IF we have a further API change

I have a hard time believing that a major refactor of pandas's internals can ever succeed if pandas 1.x does not commit to strict API stability. If you are against creating a production / fully API-stable maintenance branch, perhaps we should return to that discussion on the mailing list.

To me this feels like one of the most sensitive changes that the library has seen in a long time, on par with the datetime64 work from pandas 0.7 to pandas 0.8, and so I'm not confident that we can get it right on the first try. My gut feeling is that it will affect users in many unknown ways in the long tail, and we won't get that feedback until releasing the change in a major release, which will foil the plan of making a API-stable major release.

wesm on 16 Aug 2016

As far as the details, as @shoyer points out. Ideally we could spec this out to be 'about' what pandas 2.0 needs.

I'm also very concerned about ending up with a bolted-on solution (that we feel some obligation to stick with) before we have a chance to really dig in and see what uniform, self-contained metadata / logical types looks like practically-speaking for users (which will likely take some iteration, why add constraints now?).

wesm on 16 Aug 2016

I would personally also go the more conservative route for pandas 1.0, for the following reasons:

As already noted by others, a bigger change like this needs a few releases to stabilize. If we want to release a pandas 1.0 in the relative near future (so not in eg one year), I don't think we should include such bigger changes.
We should make upgrading to pandas 1.0 'LTS' as easy as possible for current users that are stuck on one of the older releases. Introducing pd.String, although it certainly has nice enhancements in itself for both users as for us developers, will not make this upgrade easier.
For me, this discussion is also about available time and priorities we choose. If we want to make pandas 2.0 happen, it will take a lot of time. Doing this string dtype change in 1.0, will generate more work for maintaining 1.0 (reviewing changes, a flood of issues regarding migration issues, breakages, subtle changes to flesh out to manage and solve). Time that will not be available for pushing 2.0.
Of course this can also make things easier in the code base and save developers time in this way, but if we are going to change things for 2.0 again, I don't think this will outweigh the extra work.

If we would like to add this functionality now, I would rather go the 'opt-in' route (if this is possible), and not using it by default for string columns. We _could_ rather easily make a version of the Categorical without the strict checks on the categories, and provide this for users _now_ that want a way to have a more performant string type, without the strictness of current Categorical.
But of course, then we don't have the advantage of simpler code paths internally that @jreback listed above.

jorisvandenbossche on 16 Aug 2016

ok @wesm and @jorisvandenbossche you make some valid points. So will move this to 2.0 milestone. All that said, if during discussions, it looks like pandas 2.0 will be signficantly delayed, e.g. more than 1 year out. Then we ought to reconsider non-trivial API changes.

We should be really clear what 'freezing' the API actually means.

jreback on 16 Aug 2016

We should be really clear what 'freezing' the API actually means.

Yes, let's start a separate discussion for this. Start a discussion on the pandas-dev list first? (or an issue is fine for me as well). I will try to formulate some initial thoughts this evening.

jorisvandenbossche on 16 Aug 2016

Yes let's take that discussion to the mailing list.

wesm on 16 Aug 2016

@cpcloud and I were talking about #19520.

This would allow this to proceed. IOW having an external library (pyarrow) to manage the memory of an array of strings. We could then defer all of the ops to the array extension.

would save massively on memory and be quite performant.

cc @TomAugspurger @wesm @jorisvandenbossche

jreback on 7 Feb 2018

I think that's worth exploring (more generally, I think the extension array stuff offers a decent way of trialing pyarrow-backed things in the pandas-1 codebase).

Would this be transparent to users, or would it be a third-party library, and they'd be required to somehow create an array-backed array, which would then be stored in pandas?

TomAugspurger on 7 Feb 2018

I think it'd be interesting to explore -- since we don't yet have a native operator library for Arrow string arrays, it seems limited usefulness for any analytics, and probably won't make things any faster right now (since copies of the strings as PyBytes/PyUnicode would have to be materialized to do any computations)

wesm on 7 Feb 2018

I'm going to have a shot at this using ExtensionArrays, pyarrow and numba. Mainly to see what the combination of the three produces make possible, so only expect a prototype. Looking at some simple operations like startswith, one can already see a bit better performance then with the current object arrays for strings.

One of the limitations of the Arrow and especially the memory layout of its string array container is that you will not be able to do in-place operations. I guess for a lot of operations this should not be problematic but will lead to copies in the cases a user does df.loc[25:30, 'str_col'] = "some string".

xhochy on 14 Mar 2018

@xhochy that would be awesome!

i don’t think in place modification of strings is actually that big of a deal - at worst it’s really a perf issue

jreback on 14 Mar 2018

nice library by @xhochy https://github.com/xhochy/fletcher

jreback on 21 Jun 2018

I'm interested in implementing this sometime this year.

Using Arrow as the memory layout seems like the right choice. So I think the main question is how to implement the string algorithms (startswith, extract, etc.).

In fletcher @xhochy is using numba to quickly re-implement these algorithms.

@wesm opened https://issues.apache.org/jira/browse/ARROW-555 for adding these kinds of algorithms to Arrow's C++ library.

@xhochy, @wesm at this point in time, where would you recommend investing development time if I were to find some? Was the choice of numba for fletcher just for getting a prototype together?

TomAugspurger on 22 Apr 2019

The choice for numba in fletcher was for quick prototyping and for me to understand what needs to be done to make pyarrow better accessible without the need to resort to C++ in Python packages. For the final implementation, we should add the string algorithms to the C++ Arrow library to make them also accessible for R, Ruby and friends.

Meanwhile, I think that using numba in fletcher will be a good way for us to implement some of the algorithms in a bit-more-productive-than-prototype manner and we can then gradually move them into core Arrow (which will be a bit more work). But I expect that we also will be limited for now on what we can do with Arrow structures in numba. An important addition to numba which has kept me a bit from working on fletcher is that it now supports dictionaries.

xhochy on 23 Apr 2019

Thanks for the context. For now, I think development can proceed along a
few lines simultaneously

Add infrastructure to pandas for making Arrow-backed memory the default
memory type for textual data (a pd.options setting). I think
for now we can depend on fletcher (and numba)
For the short to medium-term, continue implementing additional string
algorithms in fletcher using numba
Begin to implement these string algorithms in the C++ Arrow library

We'll have additional items to discuss like how to make Apache Arrow the
default memory for text data (a pandas 2.x discussion probably),
and the development / maintenance of fletcher (if this approach works out,
I suspect the pandas devs would be interested in maintaining things),
but those can wait.

On Tue, Apr 23, 2019 at 4:20 AM Uwe L. Korn notifications@github.com
wrote:

The choice for numba in fletcher was for quick prototyping and for me to
understand what needs to be done to make pyarrow better accessible
without the need to resort to C++ in Python packages. For the final
implementation, we should add the string algorithms to the C++ Arrow
library to make them also accessible for R, Ruby and friends.

Meanwhile, I think that using numba in fletcher will be a good way for us
to implement some of the algorithms in a bit-more-productive-than-prototype
manner and we can then gradually move them into core Arrow (which will be a
bit more work). But I expect that we also will be limited for now on what
we can do with Arrow structures in numba. An important addition to numba
which has kept me a bit from working on fletcher is that it now supports
dictionaries.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/8640#issuecomment-485722262,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOITUB7BPHU5MX5BJ6ITPR3IGNANCNFSM4AWMSYDA
.

TomAugspurger on 23 Apr 2019

Our general development mantra is optimizing for code reuse. We have a pretty healthy collaboration going with the R and Ruby communities so to implement once and use in 3 different binding layers is pretty powerful

wesm on 23 Apr 2019

What are the desired algorithms on a string array? I understand Pandas exposes the methods of Python strings, but are those actually useful for columnar work, or are other primitives more important?

pitrou on 23 Apr 2019

AFAIK these methods are pretty popular, a pandas user would often write say df[col].str.lower().str.count("e")

wesm on 23 Apr 2019

As an exercise for our roadmap (https://github.com/pandas-dev/pandas/pull/27478) I wrote a proposal for adding a string extension type to pandas: https://hackmd.io/@TomAugspurger/Hyuaby6fr That addresses the user-facing API. It explicitly doesn't change the memory representation (though it does enable a future Arrow-backed StringArray, since the actual data would be a private implementation detail).

TomAugspurger on 29 Jul 2019

One thing that came to my mind while reading through the roadmap comments:

A vectorised/efficient string type will probably always be immutable in its storage, this is completely different to the current Pandas semantics.

The main difference for strings in comparison to numeric data is that single row entries are not of a fixed size. Thus storing the strings in a continuous section of memory cannot always guarantee in-place mutability. In the case of the Arrow storage layout, where we store all strings in a contiguous, non-spaced way, you can only replace string values with strings of the exact same size. In the NumPy version, where you have a fixed size for all rows, you are wasting more memory but are able to do in-place replacement with strings of a smaller size but still need to reallocate when you want to insert strings that are larger than your current ones.

In fletcher we're working around this at the moment by creative slicing to still provide a semi-efficient mutable API but this is not a good long-term solution: https://github.com/xhochy/fletcher/blob/fe6b7fd00d9f3224df3fdb44b1bd6189c5dc3517/fletcher/base.py#L250-L328

_But in the end, this will lead to a different API experience to the pandas end-user._

xhochy on 31 Jul 2019

Thanks for that write-up Tom!

@xhochy wouldn't it be possible to provide the same end-user experience of mutability as we have now?
When doing mutations, you would indeed need to create a new buffer, copying the existing strings while inserting the ones you want to mutate. For sure, this will decrease performance of mutating (and certainly if you mutate one by one in a for loop). But that might be a worthy trade-off for better memory user / more performant algorithms (which I think will benefit more people than efficient mutation).
In such a case, we would need to build a set of tools to do "batch mutations" still relatively efficiently (eg a replace like method or a "put" with a bunch of values to set).

jorisvandenbossche on 31 Jul 2019

@xhochy wouldn't it be possible to provide the same end-user experience of mutability as we have now?

Yes, just with a different performance feel as you described.

xhochy on 31 Jul 2019

I wonder if it makes sense to have a stringarray module for Python, that uses the arrow spec but does not have an arrow dependency. Pandas and vaex could use that, or other projects that work with arrays of strings.

In vaex, almost all of the string operations are implemented in C++ (utf8 support and regex), it would not be a bad idea to split off that library. The code needs a cleanup, but it's pretty well tested, and pretty fast: https://towardsdatascience.com/vaex-a-dataframe-with-super-strings-789b92e8d861

I don't have a ton of resources to put in this, but I think it will not cost me much time. If there is serious interest in this (someone from pandas wants to do the pandas part), I'm happy to put in some hours.

Ideally, I'd like to see a clean c++ header-only library, that this library (pystringarray) and arrow could use, possibly build on xtensor (cc @SylvainCorlay @wolfv), but that can be considered an implementation default (as long as the API and the memory model stay the same).

maartenbreddels on 2 Sep 2019

I think Arrow also plans to have some string processing methods at some point, and would welcome contributions. So that could also be a place to have such functionality live.
But you explicitly mention a library compatible with but not dependent on Arrow? In Vaex, Arrow is already a dependency, or only optional? Do you think of potential use cases / users that would be interested in this, but where an Arrow dependency is a problem? (it's a heavy dependency for sure)

jorisvandenbossche on 2 Sep 2019

In vaex-core, currently (because we were future compatible due to 32bit limitation) we are not depending on arrow, although the string memory layout is arrow compatible. The vaex-arrow package is required for loading/writing with arrow files/streams, so it's an optional dependency, vaex-core does not need it.

I think now, we could have a pyarrow dependency for vaex-core, although we'll inherit all installation issue that might come with it (not much experience with it), so I'm still not 100% sure (I read there were windows wheel issues).

But the same approach can be used by other libraries, such as a hypothetical pystringarray package, which would follow the arrow spec, and expose its buffers, but not have a direct pyarrow dependency.

Another approach, discussed with @xhochy is to have a c++ library (c++ could use a header only string and stringarray library), possibly build in xtensor or compatible with. This library could be something that arrow could use, and possible pystringarray could use.

My point is, I think if general algorithms (especially string algos) go into arrow, it will be 'lost' for use outside of arrow, because it's such a big dependency.

maartenbreddels on 3 Sep 2019

Arrow is only a large dependency if you build all the optional components. I'm concerned there's some FUD being spread here about this topic -- I think it is important to develop a collaborative community that is working together on this (with open community governance) and ensure that downstream consumers can reuse the code that they need without being burdened by optional dependencies.

wesm on 3 Sep 2019

We are taking two measures in Apache Arrow to make it easier for third party projects to take on the project as a dependency:

Reducing build-time dependencies for the C++ core library to zero. There aren't many dependencies anyway but some projects have taken the position that taking on even a single transitive build dependency (for example, the Flatbuffers compiler) is unacceptable. You can follow the work and pitch in at https://issues.apache.org/jira/browse/ARROW-6637. For the time being we will continue to make the pyarrow package more comprehensive -- if more people get involved in the project we can work to modularize the Python package to enable more piecemeal installation.
Providing a "C protocol" ABI for two libraries sharing no code to nonetheless expose Arrow data structures to each other in-process without any serialization (and without having to generate the Arrow binary protocol). You can see the discussion here https://lists.apache.org/thread.html/462143a1062ad34be529c84eccacf46d0c5c92b607dbd34f6c8bbeb3@%3Cdev.arrow.apache.org%3E

wesm on 20 Sep 2019

👍3

There is a bit of a divide between people who are uncomfortable with e.g. having second-order dependencies, and being uncomfortable with a large monolithical dependency.

Having a large tree of dependencies between small packages is very well adressed by a package manager. It allows a separation of concerns between components, and the teams developing them, as soon as APIs and extension points are well-defined. This has been the path of Project Jupyter since the Big Split (tm). Monolithical projects make me somewhat more uncomfortable in general. I rarelly am interested in everything in a large monolithical project...

The way we have been doing stuff in the xtensor stack is recommending the use of a package manager. We maintain the the conda packages but xtensor packages have been packaged for Fedora, Arch Linux etc.

SylvainCorlay on 22 Sep 2019

I assure you that we hear your concerns and we will do everything we can to address them in time but it will not happen overnight. Our top priority is ensuring that our developer/contributor community is as productive as possible. Based on our contribution graph I would say we have done a good job of this.

The area where we have made the most progress on modular installs actually is in our .deb and .yum packages.

https://github.com/apache/arrow/tree/master/dev/tasks/linux-packages/debian

With recent improvements to conda / conda-forge, we can similarly achieve modularization, at least at the C++ package level.

To have modular Python installs will not be easy. We need help from more people to figure out how to address this from a tooling perspective. The current solution is optimized for developer productivity, so we have to make sure that any changes that are made to the packaging process don't make things much more difficult for contributors.

wesm on 22 Sep 2019

👍1

So until this enhancement is implemented (and adopted by most users via upgrading the library), what is the fastest way to check if a series with dtype object only consists of strings?

For example, I have the following series with dtype object and want to detect if there are any non-string values:

series = pd.Series(["string" for i in range(1_000)])
series.loc[0] = 1
def series_has_nonstring_values(series):
    # TODO: how to implement this efficiently?
    return False
assert series_has_nonstring_values(series) is True

I hope that this is the right place to address this issue/question?

8080labs on 3 Oct 2019

@8080labs with the current public API, you can use infer_dtype for this:

In [48]: series = pd.Series(["string" for i in range(1_000)])

In [49]: pd.api.types.infer_dtype(series, skipna=True)
Out[49]: 'string'

In [50]: series.loc[0] = 1 

In [51]: pd.api.types.infer_dtype(series, skipna=True) 
Out[51]: 'mixed-integer'

There is a faster is_string_array, but that is not public, but will be exposed indirectly through the string dtype that will be included in 1.0: https://github.com/pandas-dev/pandas/pull/27949

jorisvandenbossche on 3 Oct 2019

closed via #27949

WillAyd on 11 Nov 2019

There is still relevant discussion here on the second part of this enhancement: a native storage (Tom also updated the top comment to reflect this)

jorisvandenbossche on 11 Nov 2019

After learning more about the goal of Apache Arrow, vaex will happily depend on it in the (near?) future.

I want to ignore the discussion on where the c++ string library code should live (in or outside arrow), not to get sidetracked.

I'm happy to spend a bit of my time to see if I can move algorithms and unit tests to Apache Arrow, but it would be good if some pandas/arrow devs could assist me a bit (I believe @xhochy offered me help once, does that offer still stand?).

Vaex's string API is modeled on Pandas (80-90% compatible), so my guess is that Pandas should be able to make use of this move to Arrow, since it could simply forward many of the string method calls directly to Arrow once the algorithms are moved.

In short:

Is Arrow interested in string contributions from vaex' codebase (with cleanups), and willing to assist me?
Would pandas benefit from this, i.e. would it use Arrow for string processing if all of the vaex algorithms are in Arrow?

maartenbreddels on 12 Nov 2019

Thanks for the update @maartenbreddels.

Speaking for myself (not pandas-dev) I don't have a strong opinion on where these algorithms should live. I think pandas will find a way to use them regardless. Putting them in Arrow is probably convenient since we're dancing around a hard dependency on pyarrow in a few places.

I may be wrong, but I don't think any of the core pandas maintainers has C++ experience. One of us could likely help with the Python bindings though, if that'd be helpful.

TomAugspurger on 12 Nov 2019

I opened https://github.com/pandas-dev/pandas/issues/35169 for discussing how we can expose an Arrow-backed StringArray to users.

TomAugspurger on 7 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

API: .convert_objects is deprecated, do we want a .convert to replace?

jreback · 46Comments

RLS: 0.20.0

jorisvandenbossche · 50Comments

DEPR: let's deprecate

jreback · 42Comments

read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2

dragoljub · 56Comments

Parallelization for embarrassingly parallel tasks

michaelaye · 64Comments