pandas 🚀 - ENH: Allow opting in to new dtypes on I/O routines via keyword to I/...

@jorisvandenbossche is this likely for 1.0? My initial preference is to add this in the future (as an option in 1.1 say)

TomAugspurger on 20 Dec 2019

It's certainly not release critical, so let's remove from the milestone.

jorisvandenbossche on 21 Dec 2019

So do have an idea how we would like to tackle this? I think also the constructors (eg DataFrame(..)) could use a similar option.

An option like use_new_dtypes=True/False consistently across functions that create dataframes/series?
Or a better name, as "new" is not very descriptive. use_nullable_dtypes might not be fully covering for eg strings, as those were already nullable before.

jorisvandenbossche on 21 Dec 2019

I'd like to argue that if you want to get people to use the new dtypes, and especially pd.NA, then this becomes pretty important, because IMHO, most missing values are introduced when people read in data. So if you don't change the I/O routines, pd.NA is unlikely to get used.

As for names, maybe use_extension_dtypes ??

Dr-Irv on 23 Dec 2019

What I dislike about use_extension_dtypes it that it sounds as extension to pandas, which is here not the case. The hope is that at some point those are the default dtypes.
(I know, the whole thing is called Extension.., but still).

jorisvandenbossche on 23 Dec 2019

How about use_distinct_dtypes ?

Dr-Irv on 23 Dec 2019

@jorisvandenbossche I started to look at this, and if we did it for each reader, it might end up being a lot of work because of all of the different reader implementations. Here's another proposal. What if we created a method DataFrame.as_nullable_types() that would take a DataFrame and convert any column it could to a nullable type. Then if you used any reader, which didn't use the new types, you could convert the entire DataFrame in one line, so you could have df = pd.read_csv('filename.csv').as_nullable_types() or df = pd.read_excel('filename.excel').as_nullable_types(), etc. The rules could look something like this:

If dtype is object, convert to string. If it fails, leave it alone.
If dtype is float, try conversion to boolean. If that fails, try conversion to Int64. If that fails, leave it alone.

My goal here is to make it easy for people to use the new StringDType, Int64DType, and BooleanDType. If we don't do something like this, I don't think those types will get exercised, because missing values are typically encountered when reading data, and it is painful to have to specify the dtype for each column when reading data with lots of columns.

Dr-Irv on 2 Jan 2020

if we did it for each reader, it might end up being a lot of work because of all of the different reader implementations

If we want to have efficient methods, we will probably need to end up with reader-specific implementation anyway, I think. But, that doesn't mean of course that all of the readers need to support it natively, we can start with the important ones and have others convert it after reading. For example, I was planning to work on a parquet reader that directly gives you nullable integers (which avoids a copy and an unneeded roundtrip to float). All to say: I think it is still useful to have it as a reader option as well in some cases.

But that said, a helper method/function to convert an existing dataframe into a new dataframe using nullable types sounds like a good idea, that will be useful anyway.
Conceptually, it is somewhat similar as DataFrame.infer_objects ("Attempt to infer better dtypes for object columns.", except here we want to infer better dtypes for all columns)

jorisvandenbossche on 7 Jan 2020

@jorisvandenbossche I'll work on the helper method, as I think it should be in 1.0, and then later we can figure out how to change the various readers (and in which order) for a later version.

Dr-Irv on 7 Jan 2020

I put up a PR implementing a use_nullable_dtypes option for read_parquet specifically: https://github.com/pandas-dev/pandas/pull/31242

jorisvandenbossche on 23 Jan 2020

In the PR adding an option to read_parquet, we are having a discussion which is getting more general about such an option in IO, and more general about what to expect from such an option (not only the name), so moving it here.

@jreback and @WillAyd mentioned they rather prefer use_extension_dtypes than use_nullable_dtypes, to which I replied:

For me, the main reason to not use use_extension_dtypes is: 1) this option does not trigger to return extension dtypes in general. For example, it does not trigger to return categorical or datetimetz (as those are aready returned by default by pyarrow), and it does not trigger to return period or interval (those can be returned based on metadata saved in the parquet file / pyarrow exension types); in both cases, extension types will be returned even with use_extension_dtypes=False. In contrast, I find use_nullable_dtypes clearer in communicating the intent*.
In addition, and more semantically, "extension" types can give the idea of being about "external" extension types (but this is a problem in general with the term, so not that relevant here).

*I think we are going to need some terminology to denote "the dtypes that use pd.NA as missing value indicator". Also for our communication (and when discussing) about it, for in the docs, etc, it would be good to have a term for it that we can consistently use. I think "nullable dtypes" is an option for this (we already use "nullable integer dtype" for a while in the docs), although certainly not ideal, since strictly speaking other dtypes are also "nullable" (floats, object, datetime), just in a different way.
Maybe having this more general discussion can help us find matching keyword names afterwards.

jorisvandenbossche on 23 Jan 2020

reply of @WillAyd

Sure as some quick counter arguments:

The semantics are unclear to an end user; I would think most consider np.float to be nullable which this wouldn't affect
Some of the arguments for its clarity are specific to parquet, but I think become more ambiguous if we reuse the same keyword for other parsers (which I hope we would)
If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

jorisvandenbossche on 23 Jan 2020

_reply of @WillAyd_

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

I agree this is a compelling argument to not use as_nullable_dtypes. Here are some ideas:

as_NA_dtypes (ones supporting pd.NA)
as_modern_dtypes (since anything we'd want to support would be the most modern ones)
as_endorsed_dtypes (then we can determine which are endorsed/recommended ones)

Hoping that stimulates the discussion!

Dr-Irv on 23 Jan 2020

Thanks @WillAyd for those arguments. From that, it seems we still need to discuss / clarify what the exact purpose is we want to achieve with such an option

The semantics are unclear to an end user; I would think most consider np.float to be nullable which this wouldn't affect

Yes, that's what I mentioned about "nullable" not being ideal. But that's a general issue for speaking about those dtypes. And as mentioned above, I think we need to find some term for that.
If we clearly define what we mean with "nullable dtype" in the docs and use it consistently throughout the docs for that purpose, I think a term like that can work (IMO, it's at least better than no consistent term).
Also, at some point we might want to have a float dtype that uses pd.NA as missing value. So also then we need a term to distinguish it from the "classic" float dtype ("nullable float dtype" ?)

Some of the arguments for its clarity are specific to parquet, but I think become more ambiguous if we reuse the same keyword for other parsers (which I hope we would)

Parquet is one of the formats that has the most type information, so the distinction between extension types in general and nullable types in specific is indeed most relevant there. When reading eg csv you indeed can't get categoricals. But it's not fully limited to parquet. read_feather and read_orc are also based on pyarrow, so have the same type support. You can get categoricals from read_stata and read_spss, you can get datetimetz from read_sql and (maybe?) read_excel

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

Yes, if those new extension types don't use pd.NA as missing value indicator, they would purposefully not fall under this keyword. This issue here is really specifically about those dtypes using pd.NA, as those have a different behavior for operations involving missing values.
Now, I would personally argue that we shouldn't add new extension dtypes that don't use pd.NA, but that's another discussion. It's also difficult to discuss such a hypothetical case; one concrete example that has come up is something struct/json like: those probably can't be stored in typical file formats like csv anyway (and also, we could probably use pd.NA as missing value there).

jorisvandenbossche on 23 Jan 2020

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

The third point would probably the one I think is most of an issue

I agree this is a compelling argument to not use as_nullable_dtypes.

To highlight a single point of my long post above that was answering on this aspect: IMO, if such a new dtype is not using pd.NA, it should not fall under this keyword. So if we do that (debatable of course), then whathever name we come up (like use_NA_dtypes) with will have the exact same problem.

jorisvandenbossche on 23 Jan 2020

Slightly different direction but what about something like na_float_cast=True as a default? I think clearer on intention and also doesn't force to use an extension dtype unless NA values are actually detected, which could help save memory

WillAyd on 23 Jan 2020

Slightly different direction but what about something like nan_float_cast=True as a default? I think clearer on intention and also doesn't force to use an extension dtype unless NA values are actually detected, which could help save memory

That depends on the behaviour of the option. Right now, I think the intent was to use eg nullable integer dtype for all integer columns, not only those integer columns that have missing values and would otherwise be casted to float. (Also, for booleans, they get casted to object right now if there are floats)

It's true that not doing it for all columns can save memory, but personally I would prefer doing it for all columns: 1) you get a consistent result depending on the "logical" type of your column, not on the presence of missing values (eg if only reading in a part of the file, this can already differ 2) missing values can also be introduced after reading (for eg reindexing, merge, ..) and then having a nullable integer dtype ensures it doesn't get cast to float then, even if the original data didn't have nans

jorisvandenbossche on 23 Jan 2020

That depends on the behaviour of the option. Right now, I think the intent was to use eg nullable integer dtype for all integer columns, not only those integer columns that have missing values and would otherwise be casted to float. (Also, for booleans, they get casted to object right now if there are floats)

Yea I agree - a clarification on that intent definitely drives this.

I don't think it's worth adding the mask unless needed - it can certainly have non-trivial memory impacts. If my simple math is right for a 10 million row by 10 column
block of integer values adding the mask would require at least 100 MB more in memory

WillAyd on 24 Jan 2020

I don't think it's worth adding the mask unless needed

But rather than limiting the columns that would get converted to masked / nullable dtypes with the option under discussion here, I would rather try to solve this memory concern by improving the implementation. The concrete ideas we have for this: 1) make the mask optional, so it can be None of there are no missing data (this should not be too hard to do I think) 2) investigate using a bitmask instead boolean array mask (this is probably harder, as there is no standard implementation of this in Python, so that will need some custom code).

Note that the nullable dtypes are still experimental anyway (there are quite some operations that don't work yet, there are things that are slower, ..), so I think this option will in an initial phase mainly be for allowing to easily experiment, try it out. And for such a use case, I think it is more useful to convert all possible columns instead of addressing the memory concern by not converting all columns.

jorisvandenbossche on 24 Jan 2020

👍1

From @WillAyd

If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated

I had another thought on this. Because of a method like Series.shift() that creates entries with NA in it, I think that any new extension type _always_ needs to do something about NA, and IMHO, I think we would want them to use pd.NA and not np.nan to represent a "missing value"

Which may mean that a keyword such as use_missing_value_dtype might make sense, although it's a lot to type.

And if we really want to stress that this is all about missing values, using pd.MV instead of pd.NA might help get that point across, but that's probably opening up a whole other can of worms.

Dr-Irv on 24 Jan 2020

@WillAyd regarding the memory issues of masked arrays: there is https://github.com/pandas-dev/pandas/issues/30435 about making the mask optional and https://github.com/pandas-dev/pandas/issues/31293 about exploring bitarrays for the mask.

a keyword such as use_missing_value_dtype might make sense

Since np.nan in float dtype is also used as "missing value", I am not sure this is less ambiguous than use_nullable_dtype (given the argument against use_nullable_dtype that there are other dtypes that are also "nullable" without using pd.NA).

as_NA_dtypes (ones supporting pd.NA)

This is quite explicit!
For me, a drawback of this one is that I personally find that it sounds less good when using it as the general terminology to speak about this (like "the NA dtypes" in prose text).

Another option is convert_dtypes=True/False. I don't think it is very clear from the name what it would do, but that is what we ended up with for the method name in https://github.com/pandas-dev/pandas/pull/30929

jorisvandenbossche on 24 Jan 2020

a keyword such as use_missing_value_dtype might make sense

Since np.nan in float dtype is also used as "missing value", I am not sure this is less ambiguous than use_nullable_dtype (given the argument against use_nullable_dtype that there are other dtypes that are also "nullable" without using pd.NA).

as_NA_dtypes (ones supporting pd.NA)

This is quite explicit!
For me, a drawback of this one is that I personally find that it sounds less good when using it as the general terminology to speak about this (like "the NA dtypes" in prose text).

We could combine the two ideas, i.e., as_missing_value_NA_dtypes, and you say "the missing value NA dtypes" in prose texts to mean the dtypes that represent missing values using pd.NA

Another option is convert_dtypes=True/False. I don't think it is very clear from the name what it would do, but that is what we ended up with for the method name in #30929

As the author of the above, I would vote _against_ that in the I/O context because convert_dtypes also has the inherit_objects behavior, so now we have different meanings of convert_dtypes in two different contexts.

Dr-Irv on 24 Jan 2020

and you say "the missing value NA dtypes" in prose texts to mean the dtypes that represent missing values using pd.NA

For me that is fine if that's a compromise that most people can live with. In that case we should update the doc sections on "Nullable integer data type" to something like "Integer data type with NA missing value" or .. (that's a bit long for a title though)

But personally, I would just propose: let's define "nullable" as "dtype that uses NA" in context of the pandas docs / dtypes in pandas. It's a term we didn't use for anything else up to now (we otherwise don't use it when talking about missing values, all occurrences in the docs of this word are about the new dtypes)

jorisvandenbossche on 28 Jan 2020

Is there more feedback on my proposal in the comment just above (https://github.com/pandas-dev/pandas/issues/29752#issuecomment-579112054) to use "nullable dtype" in the context of the pandas documentation as meaning "dtype that uses NA as missing value indicator" ?

cc @pandas-dev/pandas-core

jorisvandenbossche on 13 Mar 2020

👍1

makes sense but would then may cause confusion with isnull which considers np.NaN, pd.NaT, None etc to be null.

simonjayhawkins on 13 Mar 2020

isnull which considers np.NaN, pd.NaT, None etc to be null.

And we have an isna function that also considers all those as missing in addition to NA ... So yes, no single terminology will be ideal given all historical baggage.
We can update the docstring of isnull to clearly indicate that it is an exact alias of isna, and does not handle "nullable dtypes" any different.

jorisvandenbossche on 20 Mar 2020

Friendly ping here. If I don't hear objections, I will take that as being OK with using the term "nullable dtypes" as for "dtype that uses NA as missing value indicator" (in documenation, docstring, potentially keywords xref #31242)

jorisvandenbossche on 11 Apr 2020

👍1

nullable_dtypes is not great

would consider: return_dtypes=‘classic’ or ‘modern’

is the point of this keyword to default to ‘modern’ ? and it’s just for compatibility ?
otherwise we would have to deprecate this to change which seems a hassle

jreback on 11 Apr 2020

is the point of this keyword to default to ‘modern’ ? and it’s just for compatibility ?
otherwise we would have to deprecate this to change which seems a hassle

The initial point is that it makes it easier for people to opt in to try out the new dtypes (similarly to the convert_dtypes method, but as an option in the readers, as that can avoid an extra conversion).
So no, initially this keyword will not default to use the nullable / modern dtypes (similarly as we are not planning to make the nullable integer dtype the default int dtype in pandas 1.x)

I am not necessarily opposed to the term "modern", but I personally find it less descriptive / more ambiguous (or subjective) as "nullable".
Eg is our categorical dtype a "modern" dtype?

jorisvandenbossche on 11 Apr 2020

I am not necessarily opposed to the term "modern", but I personally find it less descriptive / more ambiguous (or subjective) as "nullable".

I agree with Joris here. classic/modern also means that if we have an even better idea next year we have to use "post-modern"

jbrockmendel on 11 Apr 2020

I am not necessarily opposed to the term "modern", but I personally find it less descriptive / more ambiguous (or subjective) as "nullable".

I agree with Joris here. classic/modern also means that if we have an even better idea next year we have to use "post-modern"

I agree that using "modern" creates issues in the future, even if we don't have a better idea. Because let's say things remain the same 5 years from now. Then "modern" is 5 years old. Sounds odd.

There are two issues at play here:

What phrasing to use when writing prose to describe types that support pd.NA
What to use as an argument keyword

Proposal on the table by @jorisvandenbossche is to solve this via:

"nullable dtype" means "a pandas dtype supporting pd.NA"
Using a keyword argument use_nullable_dtypes

Here is another possibility:

"pandas dtype supporting pd.NA"
Keyword use_pd_NA_dtypes

I'm fine with either of the above and just looking to stimulate discussion.

Dr-Irv on 13 Apr 2020

+1 for defining "nullabe" as "dtypes using NA as the missing value indicator".

TomAugspurger on 13 Apr 2020

Since we use "nullable" already in "nullable integer" or "nullable boolean" dtypes, and I suppose we want to keep using that term in that context, I would prefer "nullable" over "dtypes supporting pd.NA".
Those two are of course cases where it is not ambiguous, since those were not able to store NA/NaNs before, while in the future we might have more ambiguous cases like floats. But if we use "nullable", we can use it consistently for all dtypes that support NA, not only the ones that didn't support NaN before.

Now, of course, even if we decide to use "nullable" as term, we will still add the phrase "dtypes supporting pd.NA" repeatedly in a lot of places in the docs/docstrings, exactly to establish this relationship.

jorisvandenbossche on 13 Apr 2020

I still also am not a fan of nullable_dtypes. Do we really need to handle StringArray and IntegerArray at the same time via the same keyword here? I wonder if separating those doesn't clear things up a bit

For the former I wonder if we should just do it i.e. a keyword doesn't toggle the behavior. What previously was an object dtype from the IO routines is replaced with the string dtype at a certain point

Seems less invasive than the integer -> float change which maybe does deserve a dedicated keyword

WillAyd on 13 Apr 2020

I wonder if we should just do it [about returning string dtype instead of object dtype]

"string" dtype is not backwards compatible with object dtype, so that's the reason that for now (apart from it being experimental), this is not the default. We will indeed want to change this at some point, though. But I think that warrants a dedicated discussion.

I wonder if separating those doesn't clear things up a bit

It's not only IntegerArray vs StringArray. Currently, it is also already BooleanArray (so just a keyword for ints won't cover this). And in the future, I hope that other dtypes will be added, such as a float dtype with NAs (https://github.com/pandas-dev/pandas/issues/32265), ...
So yes, for nullable ints we could add a specific keyword, but for me it is about enabling all new dtypes that use NA, and keep adding new keywords for each of them doesn't seem sustainable?

jorisvandenbossche on 13 Apr 2020

is a global config option an alternative, and avoid adding keywords entirely.

if a user just wants to use the new types for a single IO read call, could use the with pd.option_context(...): syntax (or pd.read_excel('filename.excel').as_nullable_types()).

potential option naming could be use_pandas_2.0_api, use_experimental_dtypes, use_StringDType sort of hierarchy. where use_experimental_dtypes includes use_StringDType and use_pandas_2.0_api includes use_experimental_dtypes

and we would start to implement the anticipated pandas 2.0 behaviour now. (and these options would be removed in 2.0rc)

without the keyword additions, we could also start adding this to say the DataFrame constructor, without changing the api.

simonjayhawkins on 15 Apr 2020

is a global config option an alternative, and avoid adding keywords entirely.

A global config is certainly interesting, and something that has been suggested from time to time as a way to opt-in to new dtypes / another way to try it out.
Personally, I think it makes sense to also have it as keywords ("local" option), because a config option works globally (and eg also impacts all the libraries you are using), and depending on your situation, one or the other might work better.

Anyway, also for a global config option we need to agree on a name ;) (which is ideally consistent)

or pd.read_excel('filename.excel').as_nullable_types()

Such a as_nullable_types() already exists, it is convert_dtypes() (it was actually called as_nullable_dtypes first, but renamed as compromise).
One reason to still have a keyword as well in addition to this method is to avoid double conversions (eg first convert integers to float in the reader, and then infer and convert the floats back to integer in convert_dtypes).

we could also start adding this to say the DataFrame constructor,

I think what would indeed also be a good place to add such a keyword.

Potential problem with using "pandas 2.0" in any of the naming, is that there are no guarantees right now this will actually be the default in pandas 2.0 .. (it will depend on when pandas 2.0 happens, how much progress we make on improving the nullable dtypes, etc)

jorisvandenbossche on 17 Apr 2020

One reason to still have a keyword as well in addition to this method is to avoid double conversions (eg first convert integers to float in the reader, and then infer and convert the floats back to integer in convert_dtypes).

Another reason is that getting the "nullable dtypes" to work in different readers might be implemented at different times for different readers. E.g., maybe read_csv() gets done first, but then getting it to work in read_excel(), read_sql() happens later. So the keyword argument may not exist for all readers. If I recall when I tried to figure this out, the use of the nullable dtypes would have to be instrumented separately for each reader, but I could be wrong about that.

Dr-Irv on 17 Apr 2020

Another attempt to try to get to a consensus or compromise here.

So for terminology + a naming scheme for keywords or options, the main proposal is:

Terminology: "nullable dtype" means "a pandas dtype supporting pd.NA"
Keyword/option name: use_nullable_dtypes=True/False
- This is explicit about that it results in nullable dtypes that use pd.NA as the missing value indicator.

Scrolling through the thread, the following alternatives for keyword names have been mentioned:

use_extension_dtypes
- The problem with this is that the keyword is not for "extension" dtypes in general, but only for extension dtypes that use pd.NA as missing value indicator (the nullable dtypes). Eg categorical, datetimetz, etc are extension dtypes, but are not subject of this keyword.
use_modern_dtypes=True/False or return_dtypes=‘classic’/‘modern’
- "modern" is not very descriptive, and also time-dependent (in a few years something else might be the newest "modern" way)
na_float_cast
- Only about integer not being cast to float, so not a general name for all nullable dtypes
use_pandas_2.0_api
- This depends on those dtypes actually becoming the default in pandas 2.0, something we cannot yet guarantee at this point
use_missing_value_dtype
- pd.NA is not our only "missing value", np.nan (for float64) and pd.NaT are still missing values as well. So IMO this is not less ambigous as use_nullable_dtype
use_experimental_dtypes
- This is a viable alternative for me, but IMO also less clear / descriptive in what it does
use_NA_dtypes (or use.pd_NA_dtypes or use_missing_value_NA_dtypes)
- This could be used together with "pandas dtypes supporting pd.NA" in documentation.

I think only the last two bullet points are viable alternatives. And for me, one of those two is fine, if that's a compromise that most people can live with. But I also think that "nullable dtypes" is strictly better: we already use this term in the docs when talking about the nullable integer and boolean dtypes (and we didn't use "nullable" before to denote anything else).

I know the term "nullable" is not perfect (eg, are our current (numpy-based) float columns that use NaN as missing value indicator "nullable" or not?), but we still need some term to refer to the dtypes that use pd.NA as missing value indicator, and up to now, I think "nullable" is still the best we have.

jorisvandenbossche on 15 May 2020

@jorisvandenbossche Great summary. I like use_nullable_dtypes and use_NA_dtypes (Note: At the beginning of your post above, you used the singular with use_nullable_dtype rather than the plural use_nullable_dtypes, so we also have to figure out whether we want singular or plural as well).

I'm wondering if we should create a poll and let people vote. And as I said on the call, if you don't like any of the suggestions made so far, you have to propose something else to add to the list, rather than just say "I don't like any of them" :-)

Dr-Irv on 15 May 2020

you used the singular with use_nullable_dtype rather than the plural use_nullable_dtypes, so we also have to figure out whether we want singular or plural as well).

Ah, that's a type (edited now). Since there are multiple nullable dtypes, I think we should just use plural.

jorisvandenbossche on 15 May 2020

+1 for the definition of "nullable dtypes" and the keyword use_nullable_dtypes.

TomAugspurger on 15 May 2020

I would prefer use_na_dtypes for explicitness

Sent from my iPhone

On May 15, 2020, at 9:39 AM, Tom Augspurger notifications@github.com wrote:

+1 for the definition of "nullable dtypes" and the keyword use_nullable_dtypes.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

WillAyd on 15 May 2020

@WillAyd does that also mean you would prefer to replace all usage of "nullable dtypes" in our documentation to "dtypes that use NA as missing value" or "dtypes supporting pd.NA" or alike ? (eg at https://pandas.pydata.org/docs/user_guide/boolean.html)

jorisvandenbossche on 15 May 2020

Another friendly ping ..

@WillAyd How strong is your preference? Maybe a bit difficult to exactly answer, but meaning: I also still have a preference for use_nullable_dtypes, and since there is a majority of participating voices OK with that, I would still like to go with use_nullable_dtypes if your preference is not too strong.
(and since I am the one pushing for it, it's a bit hard for me to make the final decision ...)

For the keyword itself, I am relatively OK with use_NA_dtypes as well, actually. But for running text in the documentation etc, I would rather prefer speaking about "nullable dtypes" (as we already do right now, actually). And if we do that in the docs, I think we should be consistent with the keyword as well.

jorisvandenbossche on 29 May 2020

I continue to think that use_NA_dtypes is better particular once we add the floatNA types. I won't belabor the point though

@jreback was also dissenting on this so should see where he stands and go from there

WillAyd on 29 May 2020

@WillAyd in that case, could you then answer to my question about what you would do with the docs?

jorisvandenbossche on 29 May 2020

Sure I think referring to them as NA dtypes is clearer than Nullable dtypes

WillAyd on 29 May 2020

have come around here, ok with use_nullable_dtypes this matches our current doc descriptions.

jreback on 29 May 2020

Do we consider this a blocker for 1.1? If so, anyone want to work on it?

TomAugspurger on 17 Jun 2020

i think we merge the current proposal

jreback on 17 Jun 2020

Anyone (@Dr-Irv, @jorisvandenbossche) able to work on this? This seems worth doing for 1.1 if it only takes a few days.

TomAugspurger on 6 Jul 2020

@TomAugspurger For me, it won't take a few days, because I think the changes should be made at a pretty low level in the readers, and I'd have to figure out how that code works. The easy solution is to use convert_dtypes inside the various readers after the current read operations, but that would be inefficient from a memory standpoint.

Dr-Irv on 6 Jul 2020

I was sent here from issue #35576. (Although https://github.com/pandas-dev/pandas/issues/29752#issuecomment-613077209 is maybe suggesting that this does need to be a separate discussion?)

Personally, the thing I care most about "turning off" in this upcoming edition of "Modern Pandas" is the fallback to the object dtype. (Because it makes things orders of magnitude slower, and it's pretty easy for it to happen "for you" behind the scenes). Yes, some of that is caused by needing pd.NA for a numpy dtype that doesn't support it, but not all cases are from that.

As a very concrete use case, I would want a way for read_csv() to always use StringDtype instead of the object dtype. (That would simplify my life a good bit.) But it's not clear to me "nullable dtypes" designation applies to that. In my mind, the object dtype is certainly nullable, no? So, I wouldn't think of use_nullable_dtypes=True as affecting that, personally. Thoughts?

chrish42 on 5 Aug 2020

@chrish42 Look at this comment: https://github.com/pandas-dev/pandas/issues/29752#issuecomment-629294120

The definition of a "nullable dtype" is "a pandas dtype supporting pd.NA". That includes StringDtype but not object, so once this gets implemented, you'd be able to get StringDtype as the result of pd.read_csv()

Dr-Irv on 5 Aug 2020

@Dr-Irv Cool, thank you. That wasn't immediately clear to me. At least, unlike other times, there's no reason why object couldn't support pd.NA, right?. And for me, the main downside with the use_nullable_types name is that even folks that don't need nullable types would benefit from setting it to True. (Well, pretty much everyone would benefit from setting it to True.) But I guess there's no perfect name here, and good documentation will have to do the rest of the job and convey to users what the name isn't conveying.

Anyways, really looking forward to the day when automatic conversions to object (because strings) and to float (because NA) are a thing of the past. So thank you all!

chrish42 on 6 Aug 2020

there's no reason why object couldn't support pd.NA, right?.

I'd recommend against it. Algorithms need to be written to explicitly handle NA since it's so unusual.

TomAugspurger on 6 Aug 2020

there's no reason why object couldn't support pd.NA, right?.

I'd recommend against it. Algorithms need to be written to explicitly handle NA since it's so unusual.

see #32931 for dedicated issue

simonjayhawkins on 6 Aug 2020

@chrish42 if one is only interested in getting the new string dtype (to avoid object dtype), it's certainly true that the use_nullable_dtypes=True is not really obvious (I think that is also one of the reasons for the long discussion above).

But, we certainly want a keyword to opt in to all nullable dtypes, so eg also nullable int and nullable bool to avoid casting to float etc. And then the name makes more sense. Adding yet another keyword for just getting string dtype is then probably too much.

jorisvandenbossche on 6 Aug 2020

Pandas: ENH: Allow opting in to new dtypes on I/O routines via keyword to I/O routines

All 59 comments

Related issues