With the new dtypes (IntegerArray
, StringArray
, etc.), if you want to use them when reading in data, you have to specify the types for all of the columns. It would be nice to have the option to use the new dtypes for all columns as a keyword to read_csv()
, read_excel()
, etc.
(ref. discussion in pandas dev meeting on 11/20/19)
@jorisvandenbossche is this likely for 1.0? My initial preference is to add this in the future (as an option in 1.1 say)
It's certainly not release critical, so let's remove from the milestone.
So do have an idea how we would like to tackle this? I think also the constructors (eg DataFrame(..)
) could use a similar option.
An option like use_new_dtypes=True/False
consistently across functions that create dataframes/series?
Or a better name, as "new" is not very descriptive. use_nullable_dtypes
might not be fully covering for eg strings, as those were already nullable before.
I'd like to argue that if you want to get people to use the new dtypes, and especially pd.NA
, then this becomes pretty important, because IMHO, most missing values are introduced when people read in data. So if you don't change the I/O routines, pd.NA
is unlikely to get used.
As for names, maybe use_extension_dtypes
??
What I dislike about use_extension_dtypes
it that it sounds as extension to pandas, which is here not the case. The hope is that at some point those are the default dtypes.
(I know, the whole thing is called Extension.., but still).
How about use_distinct_dtypes
?
@jorisvandenbossche I started to look at this, and if we did it for each reader, it might end up being a lot of work because of all of the different reader implementations. Here's another proposal. What if we created a method DataFrame.as_nullable_types()
that would take a DataFrame
and convert any column it could to a nullable type. Then if you used any reader, which didn't use the new types, you could convert the entire DataFrame
in one line, so you could have df = pd.read_csv('filename.csv').as_nullable_types()
or df = pd.read_excel('filename.excel').as_nullable_types()
, etc. The rules could look something like this:
My goal here is to make it easy for people to use the new StringDType
, Int64DType
, and BooleanDType
. If we don't do something like this, I don't think those types will get exercised, because missing values are typically encountered when reading data, and it is painful to have to specify the dtype for each column when reading data with lots of columns.
if we did it for each reader, it might end up being a lot of work because of all of the different reader implementations
If we want to have efficient methods, we will probably need to end up with reader-specific implementation anyway, I think. But, that doesn't mean of course that all of the readers need to support it natively, we can start with the important ones and have others convert it after reading. For example, I was planning to work on a parquet reader that directly gives you nullable integers (which avoids a copy and an unneeded roundtrip to float). All to say: I think it is still useful to have it as a reader option as well in some cases.
But that said, a helper method/function to convert an existing dataframe into a new dataframe using nullable types sounds like a good idea, that will be useful anyway.
Conceptually, it is somewhat similar as DataFrame.infer_objects
("Attempt to infer better dtypes for object columns.", except here we want to infer better dtypes for all columns)
@jorisvandenbossche I'll work on the helper method, as I think it should be in 1.0, and then later we can figure out how to change the various readers (and in which order) for a later version.
I put up a PR implementing a use_nullable_dtypes
option for read_parquet
specifically: https://github.com/pandas-dev/pandas/pull/31242
In the PR adding an option to read_parquet
, we are having a discussion which is getting more general about such an option in IO, and more general about what to expect from such an option (not only the name), so moving it here.
@jreback and @WillAyd mentioned they rather prefer use_extension_dtypes
than use_nullable_dtypes
, to which I replied:
For me, the main reason to not use use_extension_dtypes
is: 1) this option does not trigger to return extension dtypes in general. For example, it does not trigger to return categorical or datetimetz (as those are aready returned by default by pyarrow), and it does not trigger to return period or interval (those can be returned based on metadata saved in the parquet file / pyarrow exension types); in both cases, extension types will be returned even with use_extension_dtypes=False
. In contrast, I find use_nullable_dtypes
clearer in communicating the intent*.
In addition, and more semantically, "extension" types can give the idea of being about "external" extension types (but this is a problem in general with the term, so not that relevant here).
*I think we are going to need some terminology to denote "the dtypes that use pd.NA
as missing value indicator". Also for our communication (and when discussing) about it, for in the docs, etc, it would be good to have a term for it that we can consistently use. I think "nullable dtypes" is an option for this (we already use "nullable integer dtype" for a while in the docs), although certainly not ideal, since strictly speaking other dtypes are also "nullable" (floats, object, datetime), just in a different way.
Maybe having this more general discussion can help us find matching keyword names afterwards.
reply of @WillAyd
Sure as some quick counter arguments:
The third point would probably the one I think is most of an issue
_reply of @WillAyd_
- If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated
The third point would probably the one I think is most of an issue
I agree this is a compelling argument to not use as_nullable_dtypes
. Here are some ideas:
as_NA_dtypes
(ones supporting pd.NA
)as_modern_dtypes
(since anything we'd want to support would be the most modern ones)as_endorsed_dtypes
(then we can determine which are endorsed/recommended ones)Hoping that stimulates the discussion!
Thanks @WillAyd for those arguments. From that, it seems we still need to discuss / clarify what the exact purpose is we want to achieve with such an option
The semantics are unclear to an end user; I would think most consider np.float to be nullable which this wouldn't affect
Yes, that's what I mentioned about "nullable" not being ideal. But that's a general issue for speaking about those dtypes. And as mentioned above, I think we need to find some term for that.
If we clearly define what we mean with "nullable dtype" in the docs and use it consistently throughout the docs for that purpose, I think a term like that can work (IMO, it's at least better than no consistent term).
Also, at some point we might want to have a float dtype that uses pd.NA
as missing value. So also then we need a term to distinguish it from the "classic" float dtype ("nullable float dtype" ?)
Some of the arguments for its clarity are specific to parquet, but I think become more ambiguous if we reuse the same keyword for other parsers (which I hope we would)
Parquet is one of the formats that has the most type information, so the distinction between extension types in general and nullable types in specific is indeed most relevant there. When reading eg csv you indeed can't get categoricals. But it's not fully limited to parquet. read_feather
and read_orc
are also based on pyarrow, so have the same type support. You can get categoricals from read_stata
and read_spss
, you can get datetimetz from read_sql
and (maybe?) read_excel
If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated
Yes, if those new extension types don't use pd.NA as missing value indicator, they would purposefully not fall under this keyword. This issue here is really specifically about those dtypes using pd.NA, as those have a different behavior for operations involving missing values.
Now, I would personally argue that we shouldn't add new extension dtypes that don't use pd.NA
, but that's another discussion. It's also difficult to discuss such a hypothetical case; one concrete example that has come up is something struct/json like: those probably can't be stored in typical file formats like csv anyway (and also, we could probably use pd.NA as missing value there).
If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated
The third point would probably the one I think is most of an issue
I agree this is a compelling argument to not use as_nullable_dtypes.
To highlight a single point of my long post above that was answering on this aspect: IMO, if such a new dtype is not using pd.NA, it should not fall under this keyword. So if we do that (debatable of course), then whathever name we come up (like use_NA_dtypes
) with will have the exact same problem.
Slightly different direction but what about something like na_float_cast=True
as a default? I think clearer on intention and also doesn't force to use an extension dtype unless NA values are actually detected, which could help save memory
Slightly different direction but what about something like nan_float_cast=True as a default? I think clearer on intention and also doesn't force to use an extension dtype unless NA values are actually detected, which could help save memory
That depends on the behaviour of the option. Right now, I think the intent was to use eg nullable integer dtype for all integer columns, not only those integer columns that have missing values and would otherwise be casted to float. (Also, for booleans, they get casted to object right now if there are floats)
It's true that not doing it for all columns can save memory, but personally I would prefer doing it for all columns: 1) you get a consistent result depending on the "logical" type of your column, not on the presence of missing values (eg if only reading in a part of the file, this can already differ 2) missing values can also be introduced after reading (for eg reindexing, merge, ..) and then having a nullable integer dtype ensures it doesn't get cast to float then, even if the original data didn't have nans
That depends on the behaviour of the option. Right now, I think the intent was to use eg nullable integer dtype for all integer columns, not only those integer columns that have missing values and would otherwise be casted to float. (Also, for booleans, they get casted to object right now if there are floats)
Yea I agree - a clarification on that intent definitely drives this.
I don't think it's worth adding the mask unless needed - it can certainly have non-trivial memory impacts. If my simple math is right for a 10 million row by 10 column
block of integer values adding the mask would require at least 100 MB more in memory
I don't think it's worth adding the mask unless needed
But rather than limiting the columns that would get converted to masked / nullable dtypes with the option under discussion here, I would rather try to solve this memory concern by improving the implementation. The concrete ideas we have for this: 1) make the mask optional, so it can be None of there are no missing data (this should not be too hard to do I think) 2) investigate using a bitmask instead boolean array mask (this is probably harder, as there is no standard implementation of this in Python, so that will need some custom code).
Note that the nullable dtypes are still experimental anyway (there are quite some operations that don't work yet, there are things that are slower, ..), so I think this option will in an initial phase mainly be for allowing to easily experiment, try it out. And for such a use case, I think it is more useful to convert all possible columns instead of addressing the memory concern by not converting all columns.
From @WillAyd
- If we added more extension types in the future that aren't just focused on NA handling then we have to add another keyword on top of this to parse, which just makes things more complicated
I had another thought on this. Because of a method like Series.shift()
that creates entries with NA in it, I think that any new extension type _always_ needs to do something about NA, and IMHO, I think we would want them to use pd.NA
and not np.nan
to represent a "missing value"
Which may mean that a keyword such as use_missing_value_dtype
might make sense, although it's a lot to type.
And if we really want to stress that this is all about missing values, using pd.MV
instead of pd.NA
might help get that point across, but that's probably opening up a whole other can of worms.
@WillAyd regarding the memory issues of masked arrays: there is https://github.com/pandas-dev/pandas/issues/30435 about making the mask optional and https://github.com/pandas-dev/pandas/issues/31293 about exploring bitarrays for the mask.
a keyword such as
use_missing_value_dtype
might make sense
Since np.nan in float dtype is also used as "missing value", I am not sure this is less ambiguous than use_nullable_dtype
(given the argument against use_nullable_dtype
that there are other dtypes that are also "nullable" without using pd.NA).
as_NA_dtypes
(ones supporting pd.NA)
This is quite explicit!
For me, a drawback of this one is that I personally find that it sounds less good when using it as the general terminology to speak about this (like "the NA dtypes" in prose text).
Another option is convert_dtypes=True/False
. I don't think it is very clear from the name what it would do, but that is what we ended up with for the method name in https://github.com/pandas-dev/pandas/pull/30929
a keyword such as
use_missing_value_dtype
might make senseSince np.nan in float dtype is also used as "missing value", I am not sure this is less ambiguous than
use_nullable_dtype
(given the argument againstuse_nullable_dtype
that there are other dtypes that are also "nullable" without using pd.NA).
as_NA_dtypes
(ones supporting pd.NA)This is quite explicit!
For me, a drawback of this one is that I personally find that it sounds less good when using it as the general terminology to speak about this (like "the NA dtypes" in prose text).
We could combine the two ideas, i.e., as_missing_value_NA_dtypes
, and you say "the missing value NA dtypes" in prose texts to mean the dtypes that represent missing values using pd.NA
Another option is
convert_dtypes=True/False
. I don't think it is very clear from the name what it would do, but that is what we ended up with for the method name in #30929
As the author of the above, I would vote _against_ that in the I/O context because convert_dtypes
also has the inherit_objects
behavior, so now we have different meanings of convert_dtypes
in two different contexts.
and you say "the missing value NA dtypes" in prose texts to mean the dtypes that represent missing values using pd.NA
For me that is fine if that's a compromise that most people can live with. In that case we should update the doc sections on "Nullable integer data type" to something like "Integer data type with NA missing value" or .. (that's a bit long for a title though)
But personally, I would just propose: let's define "nullable" as "dtype that uses NA" in context of the pandas docs / dtypes in pandas. It's a term we didn't use for anything else up to now (we otherwise don't use it when talking about missing values, all occurrences in the docs of this word are about the new dtypes)
Is there more feedback on my proposal in the comment just above (https://github.com/pandas-dev/pandas/issues/29752#issuecomment-579112054) to use "nullable dtype" in the context of the pandas documentation as meaning "dtype that uses NA as missing value indicator" ?
cc @pandas-dev/pandas-core
makes sense but would then may cause confusion with isnull which considers np.NaN, pd.NaT, None etc to be null.
isnull which considers np.NaN, pd.NaT, None etc to be null.
And we have an isna
function that also considers all those as missing in addition to NA ... So yes, no single terminology will be ideal given all historical baggage.
We can update the docstring of isnull
to clearly indicate that it is an exact alias of isna
, and does not handle "nullable dtypes" any different.
Friendly ping here. If I don't hear objections, I will take that as being OK with using the term "nullable dtypes" as for "dtype that uses NA as missing value indicator" (in documenation, docstring, potentially keywords xref #31242)
nullable_dtypes is not great
would consider: return_dtypes=‘classic’ or ‘modern’
is the point of this keyword to default to ‘modern’ ? and it’s just for compatibility ?
otherwise we would have to deprecate this to change which seems a hassle
is the point of this keyword to default to ‘modern’ ? and it’s just for compatibility ?
otherwise we would have to deprecate this to change which seems a hassle
The initial point is that it makes it easier for people to opt in to try out the new dtypes (similarly to the convert_dtypes
method, but as an option in the readers, as that can avoid an extra conversion).
So no, initially this keyword will not default to use the nullable / modern dtypes (similarly as we are not planning to make the nullable integer dtype the default int dtype in pandas 1.x)
I am not necessarily opposed to the term "modern", but I personally find it less descriptive / more ambiguous (or subjective) as "nullable".
Eg is our categorical dtype a "modern" dtype?
I am not necessarily opposed to the term "modern", but I personally find it less descriptive / more ambiguous (or subjective) as "nullable".
I agree with Joris here. classic/modern also means that if we have an even better idea next year we have to use "post-modern"
I am not necessarily opposed to the term "modern", but I personally find it less descriptive / more ambiguous (or subjective) as "nullable".
I agree with Joris here. classic/modern also means that if we have an even better idea next year we have to use "post-modern"
I agree that using "modern" creates issues in the future, even if we don't have a better idea. Because let's say things remain the same 5 years from now. Then "modern" is 5 years old. Sounds odd.
There are two issues at play here:
pd.NA
Proposal on the table by @jorisvandenbossche is to solve this via:
pd.NA
"use_nullable_dtypes
Here is another possibility:
pd.NA
"use_pd_NA_dtypes
I'm fine with either of the above and just looking to stimulate discussion.
+1 for defining "nullabe" as "dtypes using NA as the missing value indicator".
Since we use "nullable" already in "nullable integer" or "nullable boolean" dtypes, and I suppose we want to keep using that term in that context, I would prefer "nullable" over "dtypes supporting pd.NA
".
Those two are of course cases where it is not ambiguous, since those were not able to store NA/NaNs before, while in the future we might have more ambiguous cases like floats. But if we use "nullable", we can use it consistently for all dtypes that support NA, not only the ones that didn't support NaN before.
Now, of course, even if we decide to use "nullable" as term, we will still add the phrase "dtypes supporting pd.NA
" repeatedly in a lot of places in the docs/docstrings, exactly to establish this relationship.
I still also am not a fan of nullable_dtypes
. Do we really need to handle StringArray and IntegerArray at the same time via the same keyword here? I wonder if separating those doesn't clear things up a bit
For the former I wonder if we should just do it i.e. a keyword doesn't toggle the behavior. What previously was an object dtype from the IO routines is replaced with the string dtype at a certain point
Seems less invasive than the integer -> float change which maybe does deserve a dedicated keyword
I wonder if we should just do it [about returning string dtype instead of object dtype]
"string" dtype is not backwards compatible with object dtype, so that's the reason that for now (apart from it being experimental), this is not the default. We will indeed want to change this at some point, though. But I think that warrants a dedicated discussion.
I wonder if separating those doesn't clear things up a bit
It's not only IntegerArray vs StringArray. Currently, it is also already BooleanArray (so just a keyword for ints won't cover this). And in the future, I hope that other dtypes will be added, such as a float dtype with NAs (https://github.com/pandas-dev/pandas/issues/32265), ...
So yes, for nullable ints we could add a specific keyword, but for me it is about enabling all new dtypes that use NA, and keep adding new keywords for each of them doesn't seem sustainable?
is a global config option an alternative, and avoid adding keywords entirely.
if a user just wants to use the new types for a single IO read call, could use the with pd.option_context(...):
syntax (or pd.read_excel('filename.excel').as_nullable_types()
).
potential option naming could be use_pandas_2.0_api
, use_experimental_dtypes
, use_StringDType
sort of hierarchy. where use_experimental_dtypes
includes use_StringDType
and use_pandas_2.0_api
includes use_experimental_dtypes
and we would start to implement the anticipated pandas 2.0 behaviour now. (and these options would be removed in 2.0rc)
without the keyword additions, we could also start adding this to say the DataFrame constructor, without changing the api.
is a global config option an alternative, and avoid adding keywords entirely.
A global config is certainly interesting, and something that has been suggested from time to time as a way to opt-in to new dtypes / another way to try it out.
Personally, I think it makes sense to also have it as keywords ("local" option), because a config option works globally (and eg also impacts all the libraries you are using), and depending on your situation, one or the other might work better.
Anyway, also for a global config option we need to agree on a name ;) (which is ideally consistent)
or pd.read_excel('filename.excel').as_nullable_types()
Such a as_nullable_types()
already exists, it is convert_dtypes()
(it was actually called as_nullable_dtypes
first, but renamed as compromise).
One reason to still have a keyword as well in addition to this method is to avoid double conversions (eg first convert integers to float in the reader, and then infer and convert the floats back to integer in convert_dtypes
).
we could also start adding this to say the DataFrame constructor,
I think what would indeed also be a good place to add such a keyword.
Potential problem with using "pandas 2.0" in any of the naming, is that there are no guarantees right now this will actually be the default in pandas 2.0 .. (it will depend on when pandas 2.0 happens, how much progress we make on improving the nullable dtypes, etc)
One reason to still have a keyword as well in addition to this method is to avoid double conversions (eg first convert integers to float in the reader, and then infer and convert the floats back to integer in
convert_dtypes
).
Another reason is that getting the "nullable dtypes" to work in different readers might be implemented at different times for different readers. E.g., maybe read_csv()
gets done first, but then getting it to work in read_excel()
, read_sql()
happens later. So the keyword argument may not exist for all readers. If I recall when I tried to figure this out, the use of the nullable dtypes would have to be instrumented separately for each reader, but I could be wrong about that.
Another attempt to try to get to a consensus or compromise here.
So for terminology + a naming scheme for keywords or options, the main proposal is:
use_nullable_dtypes=True/False
pd.NA
as the missing value indicator.Scrolling through the thread, the following alternatives for keyword names have been mentioned:
use_extension_dtypes
pd.NA
as missing value indicator (the nullable dtypes). Eg categorical, datetimetz, etc are extension dtypes, but are not subject of this keyword.use_modern_dtypes=True/False
or return_dtypes=‘classic’/‘modern’
na_float_cast
use_pandas_2.0_api
use_missing_value_dtype
pd.NA
is not our only "missing value", np.nan (for float64) and pd.NaT are still missing values as well. So IMO this is not less ambigous as use_nullable_dtype
use_experimental_dtypes
use_NA_dtypes
(or use.pd_NA_dtypes
or use_missing_value_NA_dtypes
)I think only the last two bullet points are viable alternatives. And for me, one of those two is fine, if that's a compromise that most people can live with. But I also think that "nullable dtypes" is strictly better: we already use this term in the docs when talking about the nullable integer and boolean dtypes (and we didn't use "nullable" before to denote anything else).
I know the term "nullable" is not perfect (eg, are our current (numpy-based) float columns that use NaN as missing value indicator "nullable" or not?), but we still need some term to refer to the dtypes that use pd.NA
as missing value indicator, and up to now, I think "nullable" is still the best we have.
@jorisvandenbossche Great summary. I like use_nullable_dtypes
and use_NA_dtypes
(Note: At the beginning of your post above, you used the singular with use_nullable_dtype
rather than the plural use_nullable_dtypes
, so we also have to figure out whether we want singular or plural as well).
I'm wondering if we should create a poll and let people vote. And as I said on the call, if you don't like any of the suggestions made so far, you have to propose something else to add to the list, rather than just say "I don't like any of them" :-)
you used the singular with use_nullable_dtype rather than the plural use_nullable_dtypes, so we also have to figure out whether we want singular or plural as well).
Ah, that's a type (edited now). Since there are multiple nullable dtypes, I think we should just use plural.
+1 for the definition of "nullable dtypes" and the keyword use_nullable_dtypes
.
I would prefer use_na_dtypes for explicitness
Sent from my iPhone
On May 15, 2020, at 9:39 AM, Tom Augspurger notifications@github.com wrote:

+1 for the definition of "nullable dtypes" and the keyword use_nullable_dtypes.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
@WillAyd does that also mean you would prefer to replace all usage of "nullable dtypes" in our documentation to "dtypes that use NA as missing value" or "dtypes supporting pd.NA" or alike ? (eg at https://pandas.pydata.org/docs/user_guide/boolean.html)
Another friendly ping ..
@WillAyd How strong is your preference? Maybe a bit difficult to exactly answer, but meaning: I also still have a preference for use_nullable_dtypes
, and since there is a majority of participating voices OK with that, I would still like to go with use_nullable_dtypes
if your preference is not too strong.
(and since I am the one pushing for it, it's a bit hard for me to make the final decision ...)
For the keyword itself, I am relatively OK with use_NA_dtypes
as well, actually. But for running text in the documentation etc, I would rather prefer speaking about "nullable dtypes" (as we already do right now, actually). And if we do that in the docs, I think we should be consistent with the keyword as well.
I continue to think that use_NA_dtypes
is better particular once we add the floatNA types. I won't belabor the point though
@jreback was also dissenting on this so should see where he stands and go from there
@WillAyd in that case, could you then answer to my question about what you would do with the docs?
Sure I think referring to them as NA dtypes is clearer than Nullable dtypes
have come around here, ok with use_nullable_dtypes
this matches our current doc descriptions.
Do we consider this a blocker for 1.1? If so, anyone want to work on it?
i think we merge the current proposal
Anyone (@Dr-Irv, @jorisvandenbossche) able to work on this? This seems worth doing for 1.1 if it only takes a few days.
@TomAugspurger For me, it won't take a few days, because I think the changes should be made at a pretty low level in the readers, and I'd have to figure out how that code works. The easy solution is to use convert_dtypes
inside the various readers after the current read operations, but that would be inefficient from a memory standpoint.
I was sent here from issue #35576. (Although https://github.com/pandas-dev/pandas/issues/29752#issuecomment-613077209 is maybe suggesting that this does need to be a separate discussion?)
Personally, the thing I care most about "turning off" in this upcoming edition of "Modern Pandas" is the fallback to the object dtype. (Because it makes things orders of magnitude slower, and it's pretty easy for it to happen "for you" behind the scenes). Yes, some of that is caused by needing pd.NA
for a numpy dtype that doesn't support it, but not all cases are from that.
As a very concrete use case, I would want a way for read_csv()
to always use StringDtype instead of the object dtype. (That would simplify my life a good bit.) But it's not clear to me "nullable dtypes" designation applies to that. In my mind, the object dtype is certainly nullable, no? So, I wouldn't think of use_nullable_dtypes=True
as affecting that, personally. Thoughts?
@chrish42 Look at this comment: https://github.com/pandas-dev/pandas/issues/29752#issuecomment-629294120
The definition of a "nullable dtype" is "a pandas dtype supporting pd.NA". That includes StringDtype
but not object
, so once this gets implemented, you'd be able to get StringDtype
as the result of pd.read_csv()
@Dr-Irv Cool, thank you. That wasn't immediately clear to me. At least, unlike other times, there's no reason why object couldn't support pd.NA
, right?. And for me, the main downside with the use_nullable_types
name is that even folks that don't need nullable types would benefit from setting it to True. (Well, pretty much everyone would benefit from setting it to True.) But I guess there's no perfect name here, and good documentation will have to do the rest of the job and convey to users what the name isn't conveying.
Anyways, really looking forward to the day when automatic conversions to object (because strings) and to float (because NA) are a thing of the past. So thank you all!
there's no reason why object couldn't support pd.NA, right?.
I'd recommend against it. Algorithms need to be written to explicitly handle NA since it's so unusual.
there's no reason why object couldn't support pd.NA, right?.
I'd recommend against it. Algorithms need to be written to explicitly handle NA since it's so unusual.
see #32931 for dedicated issue
@chrish42 if one is only interested in getting the new string
dtype (to avoid object dtype), it's certainly true that the use_nullable_dtypes=True
is not really obvious (I think that is also one of the reasons for the long discussion above).
But, we certainly want a keyword to opt in to all nullable dtypes, so eg also nullable int and nullable bool to avoid casting to float etc. And then the name makes more sense. Adding yet another keyword for just getting string
dtype is then probably too much.