Running select_dtypes for a variety of lengths.
import numpy as np
import pandas as pd
from timeit import default_timer as tic
ns = [0, 10, 100, 1_000, 10_000]
times = []
for n in ns:
df = pd.DataFrame(np.random.randn(10, n))
t0 = tic()
df.select_dtypes(include='int')
t1 = tic()
times.append([t1 - t0])
df = pd.DataFrame(times, columns=['include'], index=ns)
df.plot()
This looks O(n) in the number of columns. I think that can be improved (to whatever set intersection is)

Edit: maybe it's O(log(n)), I never took CS :)
For up to 10k columns I saw the same behavior as the one you described. for 10k columns it took me 3 seconds. For 100k columns it takes me 160 seconds (instead of the expected roughly 30 seconds).
But I profiled the output:
Profile
27802797 function calls (27602766 primitive calls) in 155.660 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
100001 89.922 0.001 90.608 0.001 {pandas._libs.lib.infer_dtype}
100000 29.974 0.000 29.974 0.000 {pandas._libs.algos.ensure_object}
100000 15.916 0.000 138.494 0.001 cast.py:93(maybe_downcast_to_dtype)
200005 1.329 0.000 1.329 0.000 {method 'reduce' of 'numpy.ufunc' objects}
3100361 1.309 0.000 1.424 0.000 {built-in method builtins.getattr}
4800567 1.252 0.000 2.520 0.000 {built-in method builtins.isinstance}
100000 0.914 0.000 152.612 0.002 indexing.py:313(_setitem_with_indexer)
700025 0.707 0.000 1.226 0.000 abc.py:180(__instancecheck__)
100004 0.574 0.000 148.195 0.001 managers.py:353(apply)
100000 0.552 0.000 146.671 0.001 blocks.py:841(setitem)
200000 0.548 0.000 2.162 0.000 fromnumeric.py:69(_wrapreduction)
1800238 0.526 0.000 1.268 0.000 generic.py:7(_check)
1400045 0.519 0.000 0.519 0.000 _weakrefset.py:70(__contains__)
100004 0.482 0.000 3.456 0.000 blocks.py:3195(get_block_type)
1 0.413 0.413 155.660 155.660 frame.py:3302(select_dtypes)
100000 0.367 0.000 1.180 0.000 indexing.py:284(_has_valid_positional_setitem_indexer)
500017 0.343 0.000 1.267 0.000 {pandas._libs.lib.is_list_like}
800061 0.318 0.000 0.318 0.000 {built-in method builtins.hasattr}
100008 0.305 0.000 0.435 0.000 managers.py:1470(__init__)
100010 0.292 0.000 0.686 0.000 _dtype.py:319(_name_get)
1400113/1200087 0.292 0.000 0.380 0.000 {built-in method builtins.len}
200012 0.283 0.000 0.283 0.000 generic.py:5181(__setattr__)
100000 0.269 0.000 0.351 0.000 indexing.py:1295(_tuplify)
100000 0.268 0.000 155.078 0.002 indexing.py:199(__setitem__)
100008 0.268 0.000 4.407 0.000 blocks.py:3241(make_block)
300003 0.262 0.000 0.563 0.000 {pandas._libs.lib.is_scalar}
300007 0.252 0.000 0.342 0.000 generic.py:413(_get_axis_name)
400051 0.251 0.000 1.466 0.000 base.py:231(is_dtype)
100008 0.243 0.000 0.456 0.000 blocks.py:120(__init__)
300030 0.242 0.000 0.446 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
100011 0.223 0.000 0.427 0.000 common.py:255(is_sparse)
100000 0.212 0.000 2.125 0.000 indexing.py:168(_get_setitem_indexer)
100004 0.206 0.000 0.423 0.000 blocks.py:3273(_extend_blocks)
900144 0.199 0.000 0.199 0.000 {built-in method builtins.issubclass}
300007 0.188 0.000 0.643 0.000 generic.py:426(_get_axis)
100000 0.173 0.000 1.985 0.000 blocks.py:726(_try_coerce_args)
200000 0.172 0.000 0.881 0.000 blocks.py:2738(_can_hold_element)
100000 0.161 0.000 0.516 0.000 generic.py:3324(_maybe_update_cacher)
100000 0.157 0.000 1.454 0.000 indexing.py:1996(_validate_key)
100009 0.152 0.000 0.518 0.000 dtypes.py:1092(is_dtype)
100002 0.139 0.000 0.830 0.000 common.py:99(is_bool_indexer)
100010 0.138 0.000 0.394 0.000 numerictypes.py:365(issubdtype)
200020 0.135 0.000 0.243 0.000 numerictypes.py:293(issubclass_)
100000 0.131 0.000 4.570 0.000 blocks.py:257(make_block)
100000 0.130 0.000 148.324 0.001 managers.py:559(setitem)
100000 0.122 0.000 1.362 0.000 fromnumeric.py:2664(prod)
200000 0.117 0.000 0.117 0.000 fromnumeric.py:70(<dictcomp>)
100000 0.117 0.000 0.440 0.000 indexing.py:2065(_validate_integer)
100011 0.115 0.000 0.116 0.000 generic.py:5162(__getattr__)
100034 0.115 0.000 0.280 0.000 common.py:1743(is_extension_array_dtype)
200000 0.114 0.000 0.680 0.000 cast.py:515(maybe_infer_dtype_type)
100003 0.111 0.000 0.389 0.000 generic.py:5236(_protect_consolidate)
100000 0.109 0.000 1.596 0.000 indexing.py:2162(_convert_to_indexer)
100003 0.101 0.000 0.257 0.000 generic.py:5249(f)
100000 0.098 0.000 0.153 0.000 managers.py:682(is_view)
100000 0.097 0.000 0.097 0.000 frame.py:3430(is_dtype_instance_mapper)
100038 0.097 0.000 0.137 0.000 dtypes.py:83(find)
100000 0.095 0.000 1.017 0.000 fromnumeric.py:2083(any)
100000 0.095 0.000 138.696 0.001 blocks.py:745(_try_coerce_and_cast_result)
100000 0.094 0.000 138.589 0.001 blocks.py:690(_try_cast_result)
200021 0.089 0.000 0.121 0.000 range.py:652(__len__)
100002 0.083 0.000 0.147 0.000 missing.py:128(_isna_new)
100009 0.081 0.000 0.599 0.000 common.py:642(is_interval_dtype)
100007 0.080 0.000 0.433 0.000 dtypes.py:912(is_dtype)
100007 0.078 0.000 0.737 0.000 common.py:357(is_categorical)
100013 0.077 0.000 0.520 0.000 common.py:678(is_categorical_dtype)
100000 0.076 0.000 0.115 0.000 indexers.py:65(check_setitem_lengths)
100021/100019 0.074 0.000 0.074 0.000 {built-in method numpy.array}
100000 0.073 0.000 0.291 0.000 generic.py:3393(_check_is_chained_assignment_possible)
100003 0.071 0.000 0.461 0.000 generic.py:5246(_consolidate_inplace)
100000 0.070 0.000 0.070 0.000 {method 'ravel' of 'numpy.ndarray' objects}
100000 0.067 0.000 0.270 0.000 missing.py:293(notna)
100002 0.067 0.000 0.067 0.000 {built-in method builtins.any}
100008 0.065 0.000 0.078 0.000 blocks.py:243(mgr_locs)
100000 0.065 0.000 0.218 0.000 generic.py:3319(_is_view)
100000 0.064 0.000 0.064 0.000 indexing.py:1296(<listcomp>)
100000 0.064 0.000 0.078 0.000 generic.py:3358(_clear_item_cache)
100000 0.057 0.000 1.237 0.000 indexing.py:2038(_has_valid_setitem_indexer)
100010 0.054 0.000 0.071 0.000 base.py:5707(ensure_index)
100000 0.054 0.000 0.229 0.000 indexers.py:39(is_empty_indexer)
300013 0.052 0.000 0.052 0.000 {method 'get' of 'dict' objects}
100007 0.051 0.000 0.484 0.000 common.py:608(is_period_dtype)
100022 0.050 0.000 0.385 0.000 common.py:539(is_datetime64tz_dtype)
100008 0.050 0.000 0.050 0.000 blocks.py:131(_check_ndim)
200014 0.049 0.000 0.049 0.000 blocks.py:239(mgr_locs)
100003 0.046 0.000 0.060 0.000 managers.py:919(consolidate)
100002 0.044 0.000 0.191 0.000 missing.py:48(isna)
100000 0.042 0.000 0.042 0.000 blocks.py:178(is_view)
100000 0.041 0.000 0.041 0.000 series.py:1030(axes)
100000 0.041 0.000 0.052 0.000 indexing.py:2430(convert_missing_indexer)
100000 0.041 0.000 0.423 0.000 inference.py:246(is_array_like)
100000 0.040 0.000 0.040 0.000 {built-in method pandas._libs.missing.checknull}
100002 0.039 0.000 0.051 0.000 common.py:352(apply_if_callable)
100000 0.038 0.000 0.038 0.000 generic.py:3413(_check_setitem_copy)
100002 0.037 0.000 0.275 0.000 indexers.py:13(is_list_like_indexer)
200003 0.033 0.000 0.033 0.000 {method 'items' of 'dict' objects}
100004 0.030 0.000 0.030 0.000 managers.py:418(<dictcomp>)
200005 0.030 0.000 0.030 0.000 {pandas._libs.lib.is_integer}
200008 0.026 0.000 0.026 0.000 managers.py:1606(_consolidate_inplace)
100002 0.023 0.000 0.023 0.000 {pandas._libs.lib.is_float}
100000 0.022 0.000 0.022 0.000 indexers.py:29(is_scalar_indexer)
100002 0.018 0.000 0.018 0.000 base.py:717(ndim)
100004 0.017 0.000 0.017 0.000 {method 'append' of 'list' objects}
100002 0.015 0.000 0.015 0.000 managers.py:1600(is_consolidated)
100000 0.015 0.000 0.015 0.000 series.py:1057(_is_mixed_type)
100000 0.015 0.000 0.015 0.000 {method 'clear' of 'dict' objects}
100000 0.014 0.000 0.014 0.000 blocks.py:712(_coerce_values)
100000 0.013 0.000 0.013 0.000 blocks.py:741(_try_coerce_result)
100002 0.012 0.000 0.012 0.000 {built-in method builtins.callable}
2 0.001 0.001 0.001 0.001 missing.py:219(_isna_ndarraylike)
1 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_1d_object_object}
7 0.000 0.000 0.000 0.000 {built-in method numpy.empty}
1 0.000 0.000 155.660 155.660 <string>:1(<module>)
8 0.000 0.000 0.001 0.000 series.py:194(__init__)
1 0.000 0.000 0.000 0.000 {built-in method numpy.arange}
4 0.000 0.000 0.000 0.000 {method 'copy' of 'numpy.ndarray' objects}
4 0.000 0.000 0.001 0.000 construction.py:630(sanitize_array)
33 0.000 0.000 0.000 0.000 common.py:1886(_is_dtype_type)
1 0.000 0.000 155.660 155.660 {built-in method builtins.exec}
13 0.000 0.000 0.000 0.000 common.py:2020(pandas_dtype)
3 0.000 0.000 0.001 0.000 algorithms.py:1608(take_nd)
3 0.000 0.000 0.000 0.000 algorithms.py:1481(_get_take_nd_function)
3 0.000 0.000 0.000 0.000 cast.py:986(maybe_cast_to_datetime)
4 0.000 0.000 0.000 0.000 {method 'fill' of 'numpy.ndarray' objects}
17 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
9 0.000 0.000 0.000 0.000 generic.py:162(__init__)
15 0.000 0.000 0.000 0.000 base.py:180(construct_from_string)
4 0.000 0.000 0.000 0.000 construction.py:759(_try_cast)
2 0.000 0.000 0.002 0.001 generic.py:6101(fillna)
8 0.000 0.000 0.000 0.000 series.py:416(_set_axis)
17 0.000 0.000 0.000 0.000 series.py:453(name)
1 0.000 0.000 0.002 0.002 __init__.py:1289(wrapper)
1 0.000 0.000 0.001 0.001 generic.py:5603(dtypes)
1 0.000 0.000 0.000 0.000 base.py:277(__new__)
1 0.000 0.000 0.000 0.000 indexers.py:161(maybe_convert_indices)
1 0.000 0.000 0.000 0.000 managers.py:1274(_slice_take_blocks_ax0)
2 0.000 0.000 0.000 0.000 {pandas._libs.algos.take_1d_int64_int64}
2 0.000 0.000 0.000 0.000 cast.py:298(maybe_promote)
8 0.000 0.000 0.000 0.000 common.py:951(is_integer_dtype)
2/1 0.000 0.000 0.000 0.000 common.py:1931(infer_dtype_from_object)
2 0.000 0.000 0.000 0.000 cast.py:880(maybe_infer_to_datetimelike)
1 0.000 0.000 0.001 0.001 generic.py:3524(take)
12 0.000 0.000 0.000 0.000 common.py:225(is_object_dtype)
2 0.000 0.000 0.000 0.000 cast.py:866(maybe_castable)
2 0.000 0.000 0.000 0.000 base.py:569(_simple_new)
1 0.000 0.000 0.001 0.001 managers.py:1376(take)
2 0.000 0.000 0.000 0.000 cast.py:1195(construct_1d_arraylike_from_scalar)
2 0.000 0.000 0.000 0.000 blocks.py:561(_astype)
1 0.000 0.000 0.000 0.000 {built-in method _operator.and_}
5 0.000 0.000 0.000 0.000 dtypes.py:717(construct_from_string)
2 0.000 0.000 0.000 0.000 cast.py:384(infer_dtype_from_scalar)
24 0.000 0.000 0.000 0.000 common.py:211(<lambda>)
1 0.000 0.000 0.001 0.001 managers.py:255(get_dtypes)
1 0.000 0.000 0.000 0.000 base.py:1185(__iter__)
1 0.000 0.000 0.000 0.000 base.py:652(_shallow_copy_with_infer)
3 0.000 0.000 0.000 0.000 range.py:181(_data)
1 0.000 0.000 0.001 0.001 indexing.py:1787(_getitem_axis)
4 0.000 0.000 0.000 0.000 blocks.py:768(copy)
5 0.000 0.000 0.000 0.000 managers.py:167(shape)
5 0.000 0.000 0.000 0.000 managers.py:1585(internal_values)
2 0.000 0.000 0.002 0.001 series.py:4326(fillna)
12 0.000 0.000 0.000 0.000 common.py:1850(_get_dtype)
2 0.000 0.000 0.000 0.000 inference.py:327(is_dict_like)
2 0.000 0.000 0.002 0.001 __init__.py:1287(<lambda>)
1 0.000 0.000 0.000 0.000 base.py:831(array)
5 0.000 0.000 0.000 0.000 generic.py:5145(__finalize__)
2 0.000 0.000 0.000 0.000 generic.py:5742(astype)
1 0.000 0.000 0.001 0.001 indexing.py:803(_getitem_tuple)
9 0.000 0.000 0.000 0.000 managers.py:1558(dtype)
12 0.000 0.000 0.000 0.000 series.py:460(name)
24 0.000 0.000 0.000 0.000 common.py:209(classes)
2 0.000 0.000 0.000 0.000 frame.py:3403(<lambda>)
1 0.000 0.000 0.000 0.000 indexing.py:901(_getitem_lowerdim)
1 0.000 0.000 0.000 0.000 {method 'nonzero' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 numpy_.py:39(__init__)
3 0.000 0.000 0.000 0.000 base.py:3901(values)
1 0.000 0.000 0.000 0.000 indexing.py:2377(check_bool_indexer)
1 0.000 0.000 0.000 0.000 numeric.py:2551(array_equal)
9 0.000 0.000 0.000 0.000 common.py:214(classes_and_not_datetimelike)
5 0.000 0.000 0.000 0.000 common.py:508(is_datetime64_dtype)
2 0.000 0.000 0.000 0.000 common.py:1284(is_datetime_or_timedelta_dtype)
3 0.000 0.000 0.000 0.000 common.py:1684(is_extension_type)
1 0.000 0.000 0.000 0.000 common.py:1995(_validate_date_like_dtype)
1 0.000 0.000 0.000 0.000 dtypes.py:1040(construct_from_string)
4 0.000 0.000 0.000 0.000 inference.py:386(is_hashable)
2 0.000 0.000 0.000 0.000 cast.py:574(invalidate_string_dtypes)
2 0.000 0.000 0.000 0.000 _validators.py:337(validate_fillna_kwargs)
1 0.000 0.000 0.000 0.000 base.py:885(take)
1 0.000 0.000 0.000 0.000 base.py:4297(_can_hold_identifiers_and_holds_name)
1 0.000 0.000 0.000 0.000 base.py:4377(equals)
1 0.000 0.000 0.000 0.000 numeric.py:47(__new__)
1 0.000 0.000 0.000 0.000 range.py:410(_shallow_copy)
2 0.000 0.000 0.001 0.001 blocks.py:400(fillna)
1 0.000 0.000 0.000 0.000 managers.py:126(__init__)
15 0.000 0.000 0.000 0.000 managers.py:169(<genexpr>)
8 0.000 0.000 0.000 0.000 managers.py:171(ndim)
1 0.000 0.000 0.000 0.000 managers.py:203(_is_single_block)
1 0.000 0.000 0.000 0.000 managers.py:216(_rebuild_blknos_and_blklocs)
2 0.000 0.000 0.000 0.000 managers.py:580(astype)
1 0.000 0.000 0.000 0.000 managers.py:1224(reindex_indexer)
16 0.000 0.000 0.000 0.000 managers.py:1523(_block)
3 0.000 0.000 0.000 0.000 {method 'astype' of 'numpy.ndarray' objects}
9 0.000 0.000 0.000 0.000 common.py:219(<lambda>)
3 0.000 0.000 0.000 0.000 common.py:1825(_is_dtype)
3 0.000 0.000 0.000 0.000 cast.py:1264(construct_1d_ndarray_preserving_na)
1 0.000 0.000 0.000 0.000 frame.py:397(__init__)
1 0.000 0.000 0.000 0.000 frame.py:3382(_get_info_slice)
1 0.000 0.000 0.000 0.000 generic.py:227(_validate_dtype)
1 0.000 0.000 0.001 0.001 indexing.py:1410(__getitem__)
1 0.000 0.000 0.001 0.001 indexing.py:1435(_getbool_axis)
1 0.000 0.000 0.000 0.000 indexing.py:1727(_is_scalar_access)
2 0.000 0.000 0.000 0.000 blocks.py:558(astype)
1 0.000 0.000 0.000 0.000 managers.py:340(_verify_integrity)
8 0.000 0.000 0.000 0.000 series.py:443(_set_subtyp)
9 0.000 0.000 0.000 0.000 series.py:467(dtype)
5 0.000 0.000 0.000 0.000 series.py:559(_values)
1 0.000 0.000 0.000 0.000 series.py:886(__array__)
3 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}
2 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:416(parent)
4 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray' objects}
13/11 0.000 0.000 0.000 0.000 numeric.py:469(asarray)
2 0.000 0.000 0.000 0.000 {pandas._libs.lib.values_from_object}
4 0.000 0.000 0.000 0.000 common.py:577(is_timedelta64_dtype)
1 0.000 0.000 0.000 0.000 common.py:814(is_datetimelike)
4 0.000 0.000 0.000 0.000 inference.py:353(<genexpr>)
1 0.000 0.000 0.000 0.000 cast.py:1306(maybe_cast_to_integer_array)
4 0.000 0.000 0.000 0.000 base.py:723(__len__)
8 0.000 0.000 0.000 0.000 numeric.py:134(is_all_dates)
1 0.000 0.000 0.000 0.000 range.py:196(_int64index)
1 0.000 0.000 0.000 0.000 indexing.py:229(_has_valid_tuple)
1 0.000 0.000 0.000 0.000 indexing.py:826(_multi_take_opportunity)
3 0.000 0.000 0.000 0.000 indexing.py:1412(<genexpr>)
4 0.000 0.000 0.000 0.000 blocks.py:267(make_block_same_class)
12 0.000 0.000 0.000 0.000 blocks.py:343(dtype)
1 0.000 0.000 0.000 0.000 blocks.py:2771(__init__)
4 0.000 0.000 0.000 0.000 arrays.py:7(extract_array)
1 0.000 0.000 0.000 0.000 managers.py:256(<listcomp>)
2 0.000 0.000 0.001 0.001 managers.py:574(fillna)
3 0.000 0.000 0.000 0.000 managers.py:646(is_consolidated)
1 0.000 0.000 0.000 0.000 managers.py:654(_consolidate_check)
2 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'all' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'take' of 'numpy.ndarray' objects}
3 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {built-in method numpy.datetime_data}
8 0.000 0.000 0.000 0.000 {pandas._libs.lib.is_bool}
2 0.000 0.000 0.000 0.000 common.py:711(is_string_dtype)
3 0.000 0.000 0.000 0.000 common.py:862(is_dtype_equal)
1 0.000 0.000 0.000 0.000 common.py:1006(is_signed_integer_dtype)
2 0.000 0.000 0.000 0.000 common.py:1466(needs_i8_conversion)
1 0.000 0.000 0.000 0.000 common.py:1585(is_float_dtype)
2 0.000 0.000 0.000 0.000 common.py:1619(is_bool_dtype)
3 0.000 0.000 0.000 0.000 {pandas._libs.algos.ensure_int64}
1 0.000 0.000 0.000 0.000 dtypes.py:866(construct_from_string)
2 0.000 0.000 0.000 0.000 inference.py:120(is_iterator)
1 0.000 0.000 0.000 0.000 missing.py:393(array_equivalent)
1 0.000 0.000 0.000 0.000 __init__.py:892(_align_method_SERIES)
1 0.000 0.000 0.000 0.000 __init__.py:1252(na_op)
2 0.000 0.000 0.000 0.000 common.py:297(maybe_iterable_to_list)
1 0.000 0.000 0.000 0.000 base.py:613(_get_attributes_dict)
2 0.000 0.000 0.000 0.000 base.py:700(_reset_identity)
2 0.000 0.000 0.000 0.000 base.py:4006(_internal_get_values)
1 0.000 0.000 0.000 0.000 numpy_.py:165(__array__)
1 0.000 0.000 0.000 0.000 sparse.py:223(construct_from_string)
1 0.000 0.000 0.000 0.000 frame.py:491(axes)
3 0.000 0.000 0.000 0.000 generic.py:400(_get_axis_number)
1 0.000 0.000 0.000 0.000 generic.py:430(_get_block_manager_axis)
5 0.000 0.000 0.000 0.000 generic.py:510(ndim)
1 0.000 0.000 0.000 0.000 generic.py:3384(_set_is_copy)
1 0.000 0.000 0.000 0.000 numeric.py:82(_shallow_copy)
1 0.000 0.000 0.000 0.000 range.py:342(dtype)
2 0.000 0.000 0.000 0.000 range.py:467(equals)
3 0.000 0.000 0.000 0.000 indexing.py:243(<genexpr>)
1 0.000 0.000 0.000 0.000 managers.py:132(<listcomp>)
1 0.000 0.000 0.000 0.000 managers.py:325(__len__)
1 0.000 0.000 0.000 0.000 {pandas._libs.internals.get_blkno_placements}
1 0.000 0.000 0.000 0.000 managers.py:2002(_preprocess_slice_or_indexer)
1 0.000 0.000 0.000 0.000 {method 'lower' of 'str' objects}
3 0.000 0.000 0.000 0.000 {built-in method builtins.all}
4 0.000 0.000 0.000 0.000 {built-in method builtins.hash}
1 0.000 0.000 0.000 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
2 0.000 0.000 0.000 0.000 inspect.py:73(isclass)
2 0.000 0.000 0.000 0.000 numeric.py:541(asanyarray)
4 0.000 0.000 0.000 0.000 _methods.py:42(_any)
1 0.000 0.000 0.000 0.000 {pandas._libs.lib.item_from_zerodim}
2 0.000 0.000 0.000 0.000 common.py:1163(is_datetime64_any_dtype)
1 0.000 0.000 0.000 0.000 common.py:1203(is_datetime64_ns_dtype)
1 0.000 0.000 0.000 0.000 common.py:1253(is_timedelta64_ns_dtype)
1 0.000 0.000 0.000 0.000 {pandas._libs.algos.ensure_platform_int}
4 0.000 0.000 0.000 0.000 _validators.py:231(validate_bool_kwarg)
1 0.000 0.000 0.000 0.000 __init__.py:81(get_op_result_name)
1 0.000 0.000 0.000 0.000 base.py:617(<dictcomp>)
1 0.000 0.000 0.000 0.000 base.py:681(is_)
1 0.000 0.000 0.000 0.000 base.py:1826(is_object)
1 0.000 0.000 0.000 0.000 base.py:1829(is_categorical)
1 0.000 0.000 0.000 0.000 base.py:4118(_coerce_to_ndarray)
1 0.000 0.000 0.000 0.000 numpy_.py:122(__init__)
1 0.000 0.000 0.000 0.000 generic.py:185(_init_mgr)
2 0.000 0.000 0.000 0.000 generic.py:486(_info_axis)
1 0.000 0.000 0.000 0.000 range.py:236(start)
1 0.000 0.000 0.000 0.000 range.py:259(stop)
1 0.000 0.000 0.000 0.000 indexing.py:242(_is_nested_tuple_indexer)
2 0.000 0.000 0.000 0.000 indexing.py:1710(_validate_key)
2 0.000 0.000 0.000 0.000 blocks.py:203(external_values)
5 0.000 0.000 0.000 0.000 blocks.py:207(internal_values)
2 0.000 0.000 0.000 0.000 indexing.py:2488(is_label_like)
2 0.000 0.000 0.000 0.000 blocks.py:188(is_categorical_astype)
1 0.000 0.000 0.000 0.000 managers.py:342(<genexpr>)
2 0.000 0.000 0.000 0.000 managers.py:1582(external_values)
5 0.000 0.000 0.000 0.000 series.py:399(_constructor)
2 0.000 0.000 0.000 0.000 series.py:517(values)
1 0.000 0.000 0.000 0.000 {method 'setdefault' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'isdisjoint' of 'frozenset' objects}
2 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x106c03dd8}
1 0.000 0.000 0.000 0.000 {built-in method builtins.sum}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 _methods.py:45(_all)
2 0.000 0.000 0.000 0.000 common.py:741(condition)
1 0.000 0.000 0.000 0.000 common.py:1281(<lambda>)
1 0.000 0.000 0.000 0.000 function.py:42(__call__)
1 0.000 0.000 0.000 0.000 __init__.py:104(_maybe_match_name)
1 0.000 0.000 0.000 0.000 __init__.py:1304(<lambda>)
2 0.000 0.000 0.000 0.000 common.py:306(is_null_slice)
2 0.000 0.000 0.000 0.000 base.py:747(dtype)
1 0.000 0.000 0.000 0.000 frame.py:380(_constructor)
1 0.000 0.000 0.000 0.000 numeric.py:228(inferred_type)
1 0.000 0.000 0.000 0.000 range.py:282(step)
2 0.000 0.000 0.000 0.000 indexing.py:841(<genexpr>)
1 0.000 0.000 0.000 0.000 indexing.py:1759(_get_partial_string_timestamp_match_key)
2 0.000 0.000 0.000 0.000 managers.py:236(items)
1 0.000 0.000 0.000 0.000 managers.py:655(<listcomp>)
1 0.000 0.000 0.000 0.000 managers.py:761(nblocks)
2 0.000 0.000 0.000 0.000 managers.py:935(_consolidate_inplace)
4 0.000 0.000 0.000 0.000 managers.py:1549(index)
So I was reading some portions of the code (starting from the top) and started thinking about this. Firstly, I don't think O(log n) is anyhow the right complexity :). We need to determine the dtype of every column. Hence our complexity O(n). Complexity wise, the optimum we could achieve would be O(1), if we had a dictionary, that contains a mapping of dtypes to columns. This however, would essentially require something like static typing or at least keeping track of type changes of columns after operations. I'm assuming we are not going to do this.
So complexity wise it's still O(n). But we can bring the constants down.
Hence, from the profiling, I'd say that the best chances to improve performance are in pandas._libs.lib.infer_dtype
pandas._libs.algos.ensure_object
cast.py:93(maybe_downcast_to_dtype)
However, I'm not having real cython experience here. Could somebody maybe provide some guidelines on how to tackle this issue?
A different approach: I'm assuming that to infer the dtype, a whole array is analyzed. One could maybe add an option approximate=n to select_dtypes which only takes the first n rows do infer the dtype.
Thanks, I think you're right about the complexity stuff. Sorry if I led anyone astray there.
I don't understand your comment about inference though. What exactly are we inferring? We shouldn't be passing the values of a Series / DataFrame to infer_dtype, as we already have the dtypes.
Maybe I don't know pandas internals good enough then. ;) (I was too much thinking in the direction of python not having static typing, but it doesn't make too much sense with e.g. numpy I have to admit ;))
The profile shows, that we are wasting most of our time in infer_dtype. Why are we doing that, if we know the dtypes? I mean, if we have the dtypes, e.g. in a list, it should just be in close to no time to get all the dtypes out.
I think I'll try to investigate the codepath to see, where and why infer_dtype is called.
Thanks. Glancing at the implementation, we do infer_dtype_from_object on the user-provided include and exclude. That may call infer_dtype.
We may also call it in side the include_these.iloc and exclude_these.iloc calls.
FYI @datajanko if you're looking into this I would recommend line_profiler.
%load_ext line_profiler
%lprun -f pd.DataFrame.select_dtypes df.select_dtypes(include=['int'])
gives
Total time: 2.19406 s
File: /Users/taugspurger/sandbox/pandas/pandas/core/frame.py
Function: select_dtypes at line 3371
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3371 def select_dtypes(self, include=None, exclude=None):
3372 """
3373 Return a subset of the DataFrame's columns based on the column dtypes.
3374
3375 Parameters
3376 ----------
3377 include, exclude : scalar or list-like
3378 A selection of dtypes or strings to be included/excluded. At least
3379 one of these parameters must be supplied.
3380
3381 Returns
3382 -------
3383 DataFrame
3384 The subset of the frame including the dtypes in ``include`` and
3385 excluding the dtypes in ``exclude``.
3386
3387 Raises
3388 ------
3389 ValueError
3390 * If both of ``include`` and ``exclude`` are empty
3391 * If ``include`` and ``exclude`` have overlapping elements
3392 * If any kind of string dtype is passed in.
3393
3394 Notes
3395 -----
3396 * To select all *numeric* types, use ``np.number`` or ``'number'``
3397 * To select strings you must use the ``object`` dtype, but note that
3398 this will return *all* object dtype columns
3399 * See the `numpy dtype hierarchy
3400 <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>`__
3401 * To select datetimes, use ``np.datetime64``, ``'datetime'`` or
3402 ``'datetime64'``
3403 * To select timedeltas, use ``np.timedelta64``, ``'timedelta'`` or
3404 ``'timedelta64'``
3405 * To select Pandas categorical dtypes, use ``'category'``
3406 * To select Pandas datetimetz dtypes, use ``'datetimetz'`` (new in
3407 0.20.0) or ``'datetime64[ns, tz]'``
3408
3409 Examples
3410 --------
3411 >>> df = pd.DataFrame({'a': [1, 2] * 3,
3412 ... 'b': [True, False] * 3,
3413 ... 'c': [1.0, 2.0] * 3})
3414 >>> df
3415 a b c
3416 0 1 True 1.0
3417 1 2 False 2.0
3418 2 1 True 1.0
3419 3 2 False 2.0
3420 4 1 True 1.0
3421 5 2 False 2.0
3422
3423 >>> df.select_dtypes(include='bool')
3424 b
3425 0 True
3426 1 False
3427 2 True
3428 3 False
3429 4 True
3430 5 False
3431
3432 >>> df.select_dtypes(include=['float64'])
3433 c
3434 0 1.0
3435 1 2.0
3436 2 1.0
3437 3 2.0
3438 4 1.0
3439 5 2.0
3440
3441 >>> df.select_dtypes(exclude=['int'])
3442 b c
3443 0 True 1.0
3444 1 False 2.0
3445 2 True 1.0
3446 3 False 2.0
3447 4 True 1.0
3448 5 False 2.0
3449 """
3450
3451 1 178.0 178.0 0.0 def _get_info_slice(obj, indexer):
3452 """Slice the info axis of `obj` with `indexer`."""
3453 if not hasattr(obj, "_info_axis_number"):
3454 msg = "object of type {typ!r} has no info axis"
3455 raise TypeError(msg.format(typ=type(obj).__name__))
3456 slices = [slice(None)] * obj.ndim
3457 slices[obj._info_axis_number] = indexer
3458 return tuple(slices)
3459
3460 1 12.0 12.0 0.0 if not is_list_like(include):
3461 include = (include,) if include is not None else ()
3462 1 178.0 178.0 0.0 if not is_list_like(exclude):
3463 1 1.0 1.0 0.0 exclude = (exclude,) if exclude is not None else ()
3464
3465 1 3.0 3.0 0.0 selection = (frozenset(include), frozenset(exclude))
3466
3467 1 2.0 2.0 0.0 if not any(selection):
3468 raise ValueError("at least one of include or exclude must be nonempty")
3469
3470 # convert the myriad valid dtypes object to a single representation
3471 1 161.0 161.0 0.0 include = frozenset(infer_dtype_from_object(x) for x in include)
3472 1 3.0 3.0 0.0 exclude = frozenset(infer_dtype_from_object(x) for x in exclude)
3473 3 5.0 1.7 0.0 for dtypes in (include, exclude):
3474 2 17.0 8.5 0.0 invalidate_string_dtypes(dtypes)
3475
3476 # can't both include AND exclude!
3477 1 3.0 3.0 0.0 if not include.isdisjoint(exclude):
3478 raise ValueError(
3479 "include and exclude overlap on {inc_ex}".format(
3480 inc_ex=(include & exclude)
3481 )
3482 )
3483
3484 # empty include/exclude -> defaults to True
3485 # three cases (we've already raised if both are empty)
3486 # case 1: empty include, nonempty exclude
3487 # we have True, True, ... True for include, same for exclude
3488 # in the loop below we get the excluded
3489 # and when we call '&' below we get only the excluded
3490 # case 2: nonempty include, empty exclude
3491 # same as case 1, but with include
3492 # case 3: both nonempty
3493 # the "union" of the logic of case 1 and case 2:
3494 # we get the included and excluded, and return their logical and
3495 1 602.0 602.0 0.0 include_these = Series(not bool(include), index=self.columns)
3496 1 259.0 259.0 0.0 exclude_these = Series(not bool(exclude), index=self.columns)
3497
3498 1 2.0 2.0 0.0 def is_dtype_instance_mapper(idx, dtype):
3499 return idx, functools.partial(issubclass, dtype.type)
3500
3501 1 3.0 3.0 0.0 for idx, f in itertools.starmap(
3502 10001 32227.0 3.2 1.5 is_dtype_instance_mapper, enumerate(self.dtypes)
3503 ):
3504 10000 11762.0 1.2 0.5 if include: # checks for the case of empty include or exclude
3505 10000 2130846.0 213.1 97.1 include_these.iloc[idx] = any(map(f, include))
3506 10000 15794.0 1.6 0.7 if exclude:
3507 exclude_these.iloc[idx] = not any(map(f, exclude))
3508
3509 1 470.0 470.0 0.0 dtype_indexer = include_these & exclude_these
3510 1 1530.0 1530.0 0.1 return self.loc[_get_info_slice(self, dtype_indexer)]
Thanks for the hint. I'll have a look into this.
From your example, we see that the include_these blocks (probably exclude_these as well) take the longest. The starmap iteration over each column is inefficient. Actually, we only need to do this per dtype in self.dtypes. So we would have at most something like 30 hits there. I'll work on the issue asap
Okay, for small data, this can be easily improved:
0 0.001359
10 0.002055
100 0.012956
1000 0.077123
10000 0.689586
100000 7.288168
changes to
0 0.001990
10 0.001958
100 0.001683
1000 0.008553
10000 0.136975
100000 15.086613
Note that the last line looks awful, and the second line looks nice. What did I do:
Starting after Line 3505
def is_dtype_instance_mapper(dtype, ids):
return functools.partial(issubclass, dtype.type), ids
dtypes_ids = {}
for idx, dtype in enumerate(self.dtypes):
dtypes_ids[dtype] = dtypes_ids.get(dtype, []) + [idx]
for f, ids in itertools.starmap(
is_dtype_instance_mapper, dtypes_ids.items()
):
if include: # checks for the case of empty include or exclude
include_these.iloc[ids] = any(map(f, include))
if exclude:
exclude_these.iloc[ids] = not any(map(f, exclude))
So obviously, rewriting the values of the dict and appending one item to a list does not scale well, here. A different approach I'll try next is to just groupby and getDummies on the types (would be nice if we would have a dtype index :)). I'd guess that this is already more optimized.
On a slightly different note: I'm not able to install line_profiler in the environment provided by the environment.yml. Should I raise an issue there? Besides, shall I create a WIP pull request?
Hmm I'm not sure. Are you pip or conda installing it? It does have a C extension, not sure if they have a wheel.
A WIP PR is just fine. Make sure to include a new ASV with a wide-ish DataFrame.
I tried both ways to install it, without success. I'll attach an asv probably tomorrow.
Most helpful comment
Thanks. Glancing at the implementation, we do
infer_dtype_from_objecton the user-providedincludeandexclude. That may callinfer_dtype.We may also call it in side the
include_these.ilocandexclude_these.iloccalls.FYI @datajanko if you're looking into this I would recommend line_profiler.
gives