Pandas: read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2

Created on 5 Nov 2018 · 56Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))
df2 = pd.read_csv(csv)

Problem description

pd.read_csv() using _libs.parsers.TextReader read() method is 3.5X slower on Pandas 0.23.4 on Python 3.7.1 compared to Pandas 0.22.0 on Python 3.5.2.

 4244 function calls (4210 primitive calls) in 10.273 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   10.202   10.202   10.204   10.204 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.039    0.039    0.039    0.039 internals.py:5017(_stack_arrays)
        1    0.011    0.011   10.262   10.262 parsers.py:414(_read)
        1    0.011    0.011   10.273   10.273 <string>:1(<module>)
        1    0.004    0.004    0.004    0.004 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)

Expected Output

3229 function calls (3222 primitive calls) in 2.944 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.881    2.881    2.882    2.882 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.045    0.045    0.045    0.045 internals.py:4801(_stack_arrays)
        1    0.010    0.010    2.944    2.944 parsers.py:423(_read)
        1    0.004    0.004    0.004    0.004 parsers.py:1677(__init__)
      320    0.001    0.000    0.001    0.000 common.py:777(is_integer_dtype)
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.p

Output of `pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.9.2
pip: 18.1
setuptools: 40.4.3
Cython: 0.29
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.11.0
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: 1.6.1
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None

Output of `pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 20.10.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

IO CSV Performance

Source

dragoljub

Most helpful comment

I compared the statement df2 = pd.read_csv(csv) on Python 3.7.0a3 and a4 in the Visual Studio profiler. The culprit is the isdigit function called in the parsers extension module. On 3.7.0a3 the function is fast at ~8% of samples. On 3.7.0a4 the function is slow at ~64% samples because it calls the _isdigit_l function, which seems to update and restore the locale in the current thread every time...

3.7.0a3:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 705 347 28.52%  14.04%  parsers.cp37-win_amd64.pyd
   isdigit  207 207 8.37%   8.37%   ucrtbase.dll
 - _errno   105 39  4.25%   1.58%   ucrtbase.dll
   toupper  24  24  0.97%   0.97%   ucrtbase.dll
   isspace  21  21  0.85%   0.85%   ucrtbase.dll
   [python37.dll]   1   1   0.04%   0.04%   python37.dll

3.7.0a4:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 8,613   478 83.04%  4.61%   parsers.cp37-win_amd64.pyd
 + isdigit  6,642   208 64.04%  2.01%   ucrtbase.dll
 + _isdigit_l   6,434   245 62.03%  2.36%   ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate 5,806   947 55.98%  9.13%   ucrtbase.dll
 + __acrt_getptd    2,121   1,031   20.45%  9.94%   ucrtbase.dll
   FlsGetValue  647 647 6.24%   6.24%   KernelBase.dll
 - RtlSetLastWin32Error 296 235 2.85%   2.27%   ntdll.dll
   _guard_dispatch_icall_nop    101 101 0.97%   0.97%   ucrtbase.dll
   GetLastError 46  46  0.44%   0.44%   KernelBase.dll
 + __acrt_update_multibyte_info 1,475   246 14.22%  2.37%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,229   513 11.85%  4.95%   ucrtbase.dll
 + __acrt_update_locale_info    1,263   235 12.18%  2.27%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,028   429 9.91%   4.14%   ucrtbase.dll
   _ischartype_l    383 383 3.69%   3.69%   ucrtbase.dll

cgohlke on 10 Nov 2018

👍2

All 56 comments

Can you benchmark just the change in python, and just the change in pandas separately?

TomAugspurger on 5 Nov 2018

Can you benchmark just the change in python, and just the change in pandas separately?

Yes, are there older builds of Pandas on Python 3.7.1? I suppose I can try newer pandas version on old Python.

dragoljub on 5 Nov 2018

I think 0.23.2 is the first version of pandas to support 3.7

On Mon, Nov 5, 2018 at 1:15 PM Gagi notifications@github.com wrote:

Can you benchmark just the change in python, and just the change in pandas
separately?

Yes, are there older builds of Pandas on Python 3.7.1? I suppose I can try
newer pandas version on old Python.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/23516#issuecomment-435999395,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIm27-U5HB6S2JcM2DPtr6YzlPC90ks5usI48gaJpZM4YPD2Y
.

TomAugspurger on 5 Nov 2018

I ran the test on an older Python 3.5 stack with the latest Pandas version 0.23.4 but with a lot of older versions of other modules and it looks to be running faster on Python 3.5. Now, I'm not quite sure if its pandas directly on python 3.7.1 or one of its dependencies.

Does the parser's _read() method rely on some other library that may be the culprit?

 %prun df2 = pd.read_csv(csv)
         5154 function calls (5041 primitive calls) in 2.004 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.960    1.960    1.962    1.962 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.030    0.030    0.030    0.030 internals.py:5017(_stack_arrays)
        1    0.006    0.006    2.004    2.004 parsers.py:414(_read)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.001    0.000 common.py:811(is_integer_dtype)
      518    0.000    0.000    0.000    0.000 common.py:1835(_get_dtype_type)

Installed Python 3.5 Stack With Latest pandas 0.23.4:

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.5.0
pip: 18.0
setuptools: 38.6.0
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.10.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.2.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.6
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

<\details>

dragoljub on 5 Nov 2018

Interestingly if I specify float_precision='round_trip' I get similar parsing speeds. If I specify 'high' or None then its back to the same 3.5x difference.

__Python 3.7.1__

 %prun df2 = pd.read_csv(csv, float_precision='round_trip')
         4320 function calls (4286 primitive calls) in 4.074 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.862    3.862    3.864    3.864 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.168    0.168    4.074    4.074 <string>:1(<module>)
        1    0.030    0.030    0.030    0.030 internals.py:5017(_stack_arrays)
        1    0.006    0.006    3.906    3.906 parsers.py:414(_read)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.000    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
      952    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}

__Python 3.5.2__

 %prun df2 = pd.read_csv(csv, float_precision='round_trip')
         4582 function calls (4545 primitive calls) in 3.716 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.665    3.665    3.667    3.667 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.031    0.031    0.031    0.031 internals.py:5017(_stack_arrays)
        1    0.006    0.006    3.710    3.710 parsers.py:414(_read)
        1    0.006    0.006    3.716    3.716 <string>:1(<module>)
        1    0.003    0.003    0.003    0.003 parsers.py:1685(__init__)
      321    0.001    0.000    0.001    0.000 common.py:811(is_integer_dtype)
      518    0.000    0.000    0.001    0.000 common.py:1835(_get_dtype_type)

dragoljub on 6 Nov 2018

Adding another data point. If I specify the 'python' engine. It looks like on Python 3.7.1 pandas._libs.lib.maybe_convert_numeric is 3X slower than on Python on 3.5.2

Could this be due to the cython version?

__Python 3.7.1__

%prun df2 = pd.read_csv(csv, engine='python')
         7003613 function calls (7003575 primitive calls) in 14.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    9.698    0.970    9.698    0.970 {pandas._libs.lib.maybe_convert_numeric}
  1000004    3.221    0.000    3.221    0.000 {built-in method builtins.next}
        1    0.386    0.386    4.066    4.066 parsers.py:2926(_get_lines)
        1    0.263    0.263   14.399   14.399 parsers.py:1029(read)
        4    0.154    0.038    0.247    0.062 parsers.py:2738(_remove_empty_lines)
  1000002    0.138    0.000    3.359    0.000 parsers.py:2681(_next_iter_line)
  2000072    0.125    0.000    0.125    0.000 {method 'append' of 'list' objects}
        1    0.116    0.116    0.116    0.116 {pandas._libs.lib.to_object_array}
  1000001    0.103    0.000    0.138    0.000 parsers.py:2869(<genexpr>)
2000130/2000117    0.069    0.000    0.069    0.000 {built-in method builtins.len}
       14    0.067    0.005    0.204    0.015 {built-in method builtins.max}

__Python 3.5.2__

 %prun df2 = pd.read_csv(csv, engine='python')
         7004040 function calls (7004000 primitive calls) in 8.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1000003    3.662    0.000    3.662    0.000 {built-in method builtins.next}
       10    3.270    0.327    3.270    0.327 {pandas._libs.lib.maybe_convert_numeric}
        1    0.378    0.378    4.492    4.492 parsers.py:2926(_get_lines)
        1    0.263    0.263    8.398    8.398 parsers.py:1029(read)
        4    0.167    0.042    0.245    0.061 parsers.py:2738(_remove_empty_lines)
  1000002    0.141    0.000    3.803    0.000 parsers.py:2681(_next_iter_line)
        1    0.128    0.128    0.128    0.128 {pandas._libs.lib.to_object_array}
  2000067    0.109    0.000    0.109    0.000 {method 'append' of 'list' objects}
  1000001    0.108    0.000    0.133    0.000 parsers.py:2869(<genexpr>)
       14    0.060    0.004    0.193    0.014 {built-in method builtins.max}

dragoljub on 6 Nov 2018

These last numbers are with what pandas version?

jbrockmendel on 6 Nov 2018

These last numbers are with what pandas version?

They are both Pandas 0.23.4

dragoljub on 6 Nov 2018

I tried building the latest Pandas version from source on Python 3.7.1 and still got the same slower performance. Are there any build/compile/cython flags I can set to optimize the parser?

dragoljub on 7 Nov 2018

the entire perf issue is simply the precision flag

you can choose higher precision but it takes more time; this is rarely useful though

jreback on 7 Nov 2018

the entire perf issue is simply the precision flag

you can choose higher precision but it takes more time; this is rarely useful though

I tried all three different float_precision= flags and for 'high' and None the 3.5x slowdown was still present in Python 3.7.1 vs python 2.5.2.

I also tried specifying a float_format= in the pd.to_csv() and I still see the same consistent 3.5x gap.

Can you reproduce in Python 3.6?

I should reiterate this perf difference is on the same version of Pandas 0.23.4 just different version of Python.

dragoljub on 7 Nov 2018

is there a way to specify 'xstrtod', or is that specified by float_precision=None?

I see no performance changes between 'high' and None.

dragoljub on 7 Nov 2018

Is anyone able to reproduce this on Python 3.7.1? I tested the code above on Python 3.7.0 using the Python.org interactive interpreter and it seemed to run faster than in my local 3.7.1 install.

dragoljub on 7 Nov 2018

Something is definitely up. I did a side by side compare reading the same CSV file on disk. Python 3.5 reads at 111 MB/sec and Python 3.7 reads at only 28 MB/Sec from the same SSD. Both running Pandas 0.23.4.

Could Python 3.7 have changed something in their I/O system?

__Python 3.5.2 & Pandas 0.23.4__

In [38]: %timeit pd.read_csv(r'out.csv', float_precision='high')
1.86 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

__Python 3.7.1 & Pandas 0.23.4__

In [17]: %timeit pd.read_csv(r'out.csv', float_precision='high')
7.97 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

dragoljub on 7 Nov 2018

I don't see the difference you're seeing

Python 3.5.6 |Anaconda, Inc.| (default, Aug 26 2018, 16:30:03)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.4'

In [3]: %time _ = pd.read_csv('out.csv', float_precision='high')
CPU times: user 2.59 s, sys: 214 ms, total: 2.81 s
Wall time: 2.73 s

3.7

Python 3.7.1 (default, Oct 23 2018, 14:07:42)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.23.4'

In [3]: %time _ = pd.read_csv('out.csv', float_precision='high')
CPU times: user 2.61 s, sys: 211 ms, total: 2.82 s
Wall time: 2.74 s

Both of those are using Anaconda's packages.

TomAugspurger on 7 Nov 2018

Tom, thanks for running this benchmark. Can you post your pd.show_versions() I want to re-create your stack exactly to do some more testing.

dragoljub on 7 Nov 2018

3.5

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.6.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.2.0
Cython: None
numpy: 1.15.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

3.7:

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.5.0
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

TomAugspurger on 7 Nov 2018

I tried several different fresh python installs on windows. Every Python 3.7 install 32 or 64 bit with Pandas 0.23.4 pip installed results in the slower CSV parsing speed. For fun I tried Installing a fresh Python 3.6.7 install and it again parses the same CSV 3X faster.

Is there anyone that could test this on Windows 10 and Python 3.7.1? 😕

dragoljub on 8 Nov 2018

cc @chris-b1 in case you can test on Windows

jorisvandenbossche on 8 Nov 2018

Indeed, I can confirm that there is a 3.5X slowdown when using Python 3.7.1 on Windows 10.

When I use Python 3.5.6, the performance is unchanged from 0.22.0 to 0.23.4.

These observations are consistent with what @dragoljub was observing and appears to suggest that this is a Cython / Python suggest and not pandas.

gfyoung on 8 Nov 2018

👍1

On windows 10, python 3.6 and python 3.7 I note a noticeable slowdown as well.

(py36) PS C:\Users\ttttt> ipython
Python 3.6.4 | packaged by conda-forge | (default, Dec 24 2017, 10:11:43) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 7.03 s

In [3]: %time _ = pd.read_csv('out.csv')
Wall time: 7.04 s

python 3.7

```python
(py37) PS C:\Users\ttttt> ipython
Python 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
In [6]: df.to_csv('out.csv')
In [7]: %time _ = pd.read_csv('out.csv', float_precision='high')
Wall time: 29.4 s

In [8]: %time _ = pd.read_csv('out.csv')
Wall time: 31.3 s
````

toniatop on 8 Nov 2018

For people on windows, how are you installing pandas? From source, wheels, or conda packages? And if conda, from defaults or from conda-forge?

TomAugspurger on 8 Nov 2018

Here conda-forge:

PS C:\Users\ttttt> activate py37
(py37) PS C:\Users\ttttt> conda install ipython pandas
Solving environment: done
## Package Plan ##
  environment location: C:\Miniconda\envs\py37
  added / updated specs:
    - ipython
    - pandas
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    ipython-7.1.1              |py37h39e3cac_1000         1.1 MB  conda-forge
    wcwidth-0.1.7              |             py_1          17 KB  conda-forge
    six-1.11.0                 |        py37_1001          21 KB  conda-forge
    pytz-2018.7                |             py_0         226 KB  conda-forge
    icc_rt-2017.0.4            |       h97af966_0         8.0 MB
    pygments-2.2.0             |             py_1         622 KB  conda-forge
    pickleshare-0.7.5          |        py37_1000          12 KB  conda-forge
    certifi-2018.10.15         |        py37_1000         137 KB  conda-forge
    backcall-0.1.0             |             py_0          13 KB  conda-forge
    mkl_random-1.0.1           |   py37h77b88f5_1         267 KB
    decorator-4.3.0            |             py_0          10 KB  conda-forge
    numpy-1.15.4               |   py37ha559c80_0          36 KB
    mkl-2019.0                 |              118       178.1 MB
    pandas-0.23.4              |py37h830ac7b_1000         8.7 MB  conda-forge
    prompt_toolkit-2.0.7       |             py_0         218 KB  conda-forge
    python-dateutil-2.7.5      |             py_0         218 KB  conda-forge
    colorama-0.4.0             |             py_0          15 KB  conda-forge
    mkl_fft-1.0.6              |   py37hdbbee80_0         120 KB
    jedi-0.13.1                |        py37_1000         228 KB  conda-forge
    intel-openmp-2019.0        |              118         1.7 MB
    parso-0.3.1                |             py_0          59 KB  conda-forge
    traitlets-4.3.2            |        py37_1000         130 KB  conda-forge
    ipython_genutils-0.2.0     |             py_1          21 KB  conda-forge
    numpy-base-1.15.4          |   py37h8128ebf_0         3.9 MB
    blas-1.0                   |              mkl           6 KB
    ------------------------------------------------------------
                                           Total:       203.7 MB

toniatop on 8 Nov 2018

Thanks @toniatop. Can you create a couple environments with just defaults to see if it's an issue with how it was compiled for conda-forge?

TomAugspurger on 8 Nov 2018

Redone everything forcing --channel anaconda, same results.

toniatop on 8 Nov 2018

👍1

cc @jjhelmus any thoughts on
https://github.com/pandas-dev/pandas/issues/23516#issuecomment-436958298? The tldr is that

pd.read_csv is 3-4x slower on python 3.7 vs. python 3.6
the slowdown is (apparently) windows only
observed on packages from conda-forge and defaults

TomAugspurger on 8 Nov 2018

I have also tested this with the pre built wheels from: https://www.lfd.uci.edu/~gohlke/pythonlibs/ With the same results.

I also ran a source build of latest GitHub code with python setup.py bdist_wheel with python 3.7.1 and got the same results that way.

I wonder if something in the build script has changed or some compile flag on windows.

dragoljub on 8 Nov 2018

I also ran a source build of latest GitHub code with python setup.py bdist_wheel with python 3.7.1 and got the same results that way.

Did you also try building from source with 3.6?

TomAugspurger on 8 Nov 2018

I will try building with python 3.6.7 today. I will also use the latest version of cython in case that’s the culprit.

Thanks for all those who confirmed this on windows.

dragoljub on 8 Nov 2018

I jut rebuilt the latest Pandas version (0.23.4+) from source with latest Cython 0.29 on Python 3.6.7 on Windows 10 and the parsing speed is _fast_. So it seems somehow related to Python 3.7 on windows. Not sure what it could be. Does the Python IO system pass data to the Cython/C parser? Was the Python 3.7 version compiled without optimizations?

Microsoft Windows [Version 10.0.16299.726]
(c) 2017 Microsoft Corporation. All rights reserved.

Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0+unknown
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: 0.29
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

In [3]: import io

In [4]: import numpy as np

In [5]: %time df = pd.DataFrame(np.random.randn(1000000, 10), columns=('COL{}'.format(i) for i in range(10)))
Wall time: 207 ms

In [6]: %time csv = io.StringIO(df.to_csv(index=False))
Wall time: 13.2 s

In [7]: %time df2 = df2 = pd.read_csv(csv, float_precision='high')
Wall time: 1.96 s

dragoljub on 8 Nov 2018

I posted an Issue linking this on the Python Issue tracker so at least they can take a look. Since the performance seems good on Linux I'm hopeful a compile/config fix could address it for Windows.

https://bugs.python.org/issue35195

dragoljub on 8 Nov 2018

Just to cross a plausible suspect off the list - the MSVC version does not seem to matter.

# builds with MSVC 2017 with python 3.7 (assuming installed)
λ python setup.py build_ext -i

# %timeit pd.read_csv('tmp.csv')
# 14.6 s ± 701 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# builds with MSVC 2015
λ "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" x64
λ python setup.py build_ext -i -f

# %timeit pd.read_csv('tmp.csv')
# 15.2 s ± 2.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Was the Python 3.7 version compiled without optimizations?

No, 3.7 extensions are continuing to be built with /Ox

chris-b1 on 9 Nov 2018

Is the slowdown only in IO-related methods?

TomAugspurger on 9 Nov 2018

It seems the issue is actually the float parsing? Which is odd because our xstrtod doesn't (?) interact with Python at all.

In fact, as @dragoljub noted, using the round_trip parser is faster, which DOES call back to python

In [1]: %timeit pd.read_csv('tmp.csv')
15.2 s ± 2.27 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit pd.read_csv('tmp.csv', float_precision='precise')
18.9 s ± 984 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.read_csv('tmp.csv', float_precision='round_trip')
8.67 s ± 205 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

chris-b1 on 9 Nov 2018

I tested pure 0-9 INT parsing and its also ~3.5X slower on Python 3.7.1 in Windows. So I have a feeling this issue is with the data coming in before we get to any parsing (float or otherwise).

I also see a suspicious line of code being called in Python 3.7 but not in 3.5.2:

{built-in method _thread.allocate_lock}

Could this be something new in Python 3.7.1 that is interrupting the parser? Do we need to release the GIL differently for Python 3.7?

__DataFrame Setup:__

import io
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (1000000, 10)), columns=('COL{}'.format(i) for i in range(10)))
csv = io.StringIO(df.to_csv(index=False))

```python
print(df.head().to_csv(index=False))

COL0,COL1,COL2,COL3,COL4,COL5,COL6,COL7,COL8,COL9
2,1,7,3,2,3,0,4,5,5
6,3,1,7,0,3,6,3,0,8
6,8,4,2,1,5,1,4,3,3
0,8,5,8,0,4,1,8,4,1
4,8,0,0,4,0,3,0,6,3

__Python 3.7.1 & Pandas 0.23.4 -- Slow__
```python
3676 function calls (3642 primitive calls) in 2.132 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.075    2.075    2.075    2.075 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.039    0.039    0.039    0.039 internals.py:5017(_stack_arrays)
        1    0.009    0.009    2.132    2.132 parsers.py:414(_read)
        1    0.004    0.004    0.004    0.004 parsers.py:1685(__init__)
        1    0.000    0.000    0.000    0.000 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      161    0.000    0.000    0.000    0.000 common.py:811(is_integer_dtype)
      206    0.000    0.000    0.000    0.000 common.py:1835(_get_dtype_type)
      824    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method nt.stat}
      8/2    0.000    0.000    0.001    0.000 <frozen importlib._bootstrap>:978(_find_and_load)
      4/3    0.000    0.000    0.000    0.000 base.py:255(__new__)
       76    0.000    0.000    0.000    0.000 base.py:61(is_dtype)
        2    0.000    0.000    0.001    0.000 {pandas._libs.lib.clean_index_list}
        1    0.000    0.000    2.132    2.132 {built-in method builtins.exec}
      462    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
       41    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
    89/78    0.000    0.000    0.000    0.000 {built-in method builtins.len}
       86    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
      149    0.000    0.000    0.000    0.000 generic.py:7(_check)
        1    0.000    0.000    0.040    0.040 internals.py:4880(form_blocks)
      3/2    0.000    0.000    0.001    0.001 series.py:166(__init__)
       13    0.000    0.000    0.000    0.000 internals.py:3148(get_block_type)
      245    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        7    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.empty}
       12    0.000    0.000    0.000    0.000 cast.py:971(maybe_cast_to_datetime)
        3    0.000    0.000    0.000    0.000 internals.py:237(mgr_locs)
        1    0.000    0.000    0.000    0.000 internals.py:3363(_rebuild_blknos_and_blklocs)
        8    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
        2    0.000    0.000    0.000    0.000 {built-in method pandas._libs.lib.array_equivalent_object}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:882(_find_spec)
       12    0.000    0.000    0.000    0.000 series.py:4019(_sanitize_array)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
       16    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
        1    0.000    0.000    2.075    2.075 parsers.py:1846(read)
        2    0.000    0.000    0.000    0.000 missing.py:189(_isna_ndarraylike)
        2    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    2.117    2.117 parsers.py:1029(read)
        1    0.000    0.000    2.132    2.132 parsers.py:542(parser_f)

__Python 3.5.2 & Pandas 0.23.4 -- Fast__

%prun df2 = pd.read_csv(csv, float_precision=None)

2623 function calls (2616 primitive calls) in 0.661 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.604    0.604    0.605    0.605 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.039    0.039    0.039    0.039 internals.py:4801(_stack_arrays)
        1    0.011    0.011    0.661    0.661 parsers.py:423(_read)
        1    0.004    0.004    0.004    0.004 parsers.py:1677(__init__)
        1    0.000    0.000    0.000    0.000 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      3/2    0.000    0.000    0.001    0.000 base.py:181(__new__)
      160    0.000    0.000    0.000    0.000 common.py:777(is_integer_dtype)
      186    0.000    0.000    0.000    0.000 common.py:1773(_get_dtype_type)
      622    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      108    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
        1    0.000    0.000    0.605    0.605 parsers.py:1837(read)
        1    0.000    0.000    0.000    0.000 {pandas._libs.lib.clean_index_list}
      129    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
      118    0.000    0.000    0.000    0.000 generic.py:7(_check)
        1    0.000    0.000    0.039    0.039 internals.py:4645(form_blocks)
       10    0.000    0.000    0.000    0.000 cast.py:935(maybe_cast_to_datetime)
       61    0.000    0.000    0.000    0.000 dtypes.py:85(is_dtype)
      415    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.661    0.661 {built-in method builtins.exec}
        1    0.000    0.000    0.661    0.661 parsers.py:557(parser_f)

dragoljub on 9 Nov 2018

I also found this bug fix a while ago that made some changes to pandas/src/parser/io.c PyGILState_Ensure() was called, possibly interacting differently on windows systesm with threads on Python 3.7.

11790

https://github.com/pandas-dev/pandas/pull/11790/files#diff-006bfcefd42c2be38e538fdd3219dbfdL125

dragoljub on 9 Nov 2018

I'd be surprised that change matters, but I'm at a loss here, so maybe! Another possibility is that cython made some tweaks to threading logic for python 3.7 compat - again, wouldn't think that's the issue here, but possible some kind of bad interaction.
https://github.com/cython/cython/issues/1978

chris-b1 on 9 Nov 2018

I'd be surprised that change matters, but I'm at a loss here, so maybe! Another possibility is that cython made some tweaks to threading logic for python 3.7 compat - again, wouldn't think that's the issue here, but possible some kind of bad interaction.
cython/cython#1978

Good info. I'm just surprised that people do not see this on Linux. I'll try OSX next.

dragoljub on 9 Nov 2018

FYI my timings above were on OSX (no slowdown)

On Fri, Nov 9, 2018 at 2:16 PM Gagi notifications@github.com wrote:

I'd be surprised that change matters, but I'm at a loss here, so maybe!
Another possibility is that cython made some tweaks to threading logic for
python 3.7 compat - again, wouldn't think that's the issue here, but
possible some kind of bad interaction.
cython/cython#1978 https://github.com/cython/cython/issues/1978

Good info. I'm just surprised that people do not see this on Linux. I'll
try OSX next.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/23516#issuecomment-437482307,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIl8M-FYdJ366ZxFHd1Rdur0sKyLEks5uteKAgaJpZM4YPD2Y
.

TomAugspurger on 9 Nov 2018

Looks like the slowdown first shows up in Python 3.7.0a4:

>C:\python-3.7.0a3-amd64\python.exe -m cProfile -s tottime pandascsv.py
         235992 function calls (229477 primitive calls) in 21.525 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      200   11.316    0.057   11.316    0.057 {method 'astype' of 'numpy.ndarray' objects}
      100    6.596    0.066    6.596    0.066 {pandas._libs.writers.write_csv_rows}
        1    2.111    2.111    2.112    2.112 {method 'read' of 'pandas._libs.parsers.TextReader' objects}

>C:\python-3.7.0a4-amd64\python.exe -m cProfile -s tottime pandascsv.py
         236639 function calls (230127 primitive calls) in 26.550 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    9.849    9.849    9.850    9.850 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
      200    8.766    0.044    8.766    0.044 {method 'astype' of 'numpy.ndarray' objects}
      100    6.469    0.065    6.469    0.065 {pandas._libs.writers.write_csv_rows}

cgohlke on 9 Nov 2018

Very Interesting. I'll try Py 3.7.0a3 to confirm this on my systems. Is the diff between 0a3 to 0a4 easy to find from Pythons release notes?

dragoljub on 9 Nov 2018

easy to find from Pythons release notes?

https://docs.python.org/3.7/whatsnew/changelog.html#python-3-7-0-alpha-4

Maybe bpo-29240: Add a new UTF-8 mode: implementation of the PEP 540?

cgohlke on 9 Nov 2018

I can also confirm the changes from Python 3.7.0a3 to 3.7.0a4 show the slowdown on my Win10 test system. Thanks for finding when the slowdown occurred.

__Python 3.7.0a3 -- Fast Parse__

 %prun df2 = pd.read_csv(csv)
         5781 function calls (5743 primitive calls) in 3.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.953    2.953    2.955    2.955 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.063    0.063    0.063    0.063 internals.py:5017(_stack_arrays)
        1    0.016    0.016    3.052    3.052 parsers.py:414(_read)
        1    0.009    0.009    3.062    3.062 <string>:1(<module>)
        1    0.009    0.009    0.009    0.009 parsers.py:1685(__init__)
       32    0.004    0.000    0.004    0.000 {built-in method nt.stat}
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.001    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
        7    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.empty}
       32    0.000    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:1235(find_spec)
      988    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      163    0.000    0.000    0.000    0.000 common.py:1527(is_float_dtype)
      718    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:57(_path_join)
        8    0.000    0.000    0.005    0.001 <frozen importlib._bootstrap_external>:1119(_get_spec)
      133    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       68    0.000    0.000    0.000    0.000 generic.py:7(_check)

__Python 3.7.0a4 -- Slow Parse__

 %prun df2 = pd.read_csv(csv)
         8007 function calls (7219 primitive calls) in 14.192 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   14.092   14.092   14.094   14.094 {method 'read' of 'pandas._libs.parsers.TextReader' objects}
        1    0.061    0.061    0.062    0.062 internals.py:5017(_stack_arrays)
        1    0.016    0.016   14.192   14.192 parsers.py:414(_read)
        1    0.008    0.008    0.008    0.008 parsers.py:1685(__init__)
       32    0.004    0.000    0.004    0.000 {built-in method nt.stat}
        1    0.001    0.001    0.001    0.001 {method 'close' of 'pandas._libs.parsers.TextReader' objects}
      321    0.001    0.000    0.002    0.000 common.py:811(is_integer_dtype)
      516    0.001    0.000    0.001    0.000 common.py:1835(_get_dtype_type)
        7    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.empty}
    115/4    0.000    0.000    0.001    0.000 abc.py:194(__subclasscheck__)
       32    0.000    0.000    0.005    0.000 <frozen importlib._bootstrap_external>:1322(find_spec)
 1324/988    0.000    0.000    0.002    0.000 {built-in method builtins.isinstance}
  937/725    0.000    0.000    0.002    0.000 {built-in method builtins.issubclass}
      163    0.000    0.000    0.000    0.000 common.py:1527(is_float_dtype)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:57(_path_join)
      192    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:59(<listcomp>)
        8    0.000    0.000    0.005    0.001 <frozen importlib._bootstrap_external>:1206(_get_spec)
    89/78    0.000    0.000    0.000    0.000 {built-in method builtins.len}
      192    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}

dragoljub on 10 Nov 2018

I tried playing around with the UTF-8 mode settings with ENV variables and cmd line args on Windows and was not able to get faster parsing speed on Python 3.7.0a4.

https://www.python.org/dev/peps/pep-0540/#proposal

The benefit of the locale coercion approach is that it helps ensure that encoding handling in binary extension modules and child processes is consistent with Python's encoding handling. The upside of the UTF-8 Mode approach is that it allows an embedding application to change the interpreter's behaviour without having to change the process global locale settings.

So is it possible that somewhere the C parser extension we can just set the locale to UTF-8 and this issue would go away on Windows? I was hoping the ENV variable settings would fix the issue but it did not make a difference in my testing.

dragoljub on 10 Nov 2018

3.7.0a3:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 705 347 28.52%  14.04%  parsers.cp37-win_amd64.pyd
   isdigit  207 207 8.37%   8.37%   ucrtbase.dll
 - _errno   105 39  4.25%   1.58%   ucrtbase.dll
   toupper  24  24  0.97%   0.97%   ucrtbase.dll
   isspace  21  21  0.85%   0.85%   ucrtbase.dll
   [python37.dll]   1   1   0.04%   0.04%   python37.dll

3.7.0a4:
Function Name   Inclusive Samples   Exclusive Samples   Inclusive Samples % Exclusive Samples % Module Name
 + [parsers.cp37-win_amd64.pyd] 8,613   478 83.04%  4.61%   parsers.cp37-win_amd64.pyd
 + isdigit  6,642   208 64.04%  2.01%   ucrtbase.dll
 + _isdigit_l   6,434   245 62.03%  2.36%   ucrtbase.dll
 + _LocaleUpdate::_LocaleUpdate 5,806   947 55.98%  9.13%   ucrtbase.dll
 + __acrt_getptd    2,121   1,031   20.45%  9.94%   ucrtbase.dll
   FlsGetValue  647 647 6.24%   6.24%   KernelBase.dll
 - RtlSetLastWin32Error 296 235 2.85%   2.27%   ntdll.dll
   _guard_dispatch_icall_nop    101 101 0.97%   0.97%   ucrtbase.dll
   GetLastError 46  46  0.44%   0.44%   KernelBase.dll
 + __acrt_update_multibyte_info 1,475   246 14.22%  2.37%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,229   513 11.85%  4.95%   ucrtbase.dll
 + __acrt_update_locale_info    1,263   235 12.18%  2.27%   ucrtbase.dll
 - __crt_state_management::get_current_state_index  1,028   429 9.91%   4.14%   ucrtbase.dll
   _ischartype_l    383 383 3.69%   3.69%   ucrtbase.dll

cgohlke on 10 Nov 2018

👍2

Great work debugging this. I would guess any other code paths calling isdigit would also be slowed down on windows.

dragoljub on 10 Nov 2018

Just a note for people looking at xstrtod (and thanks for doing so BTW, this looks like a really tough issue): there are two of them (#19361). Off the top of my head I'm not sure which is used in what context.

jbrockmendel on 10 Nov 2018

I may have found a pure python example that seems to show a similar but smaller 2.5X slowdown. Also note the variability is 15X higher for the 3.7.1 code. Possibly indicating that locale argument is passed/used in some calls but not others.

Can someone test this on linux and see if you see a difference?

__Python 3.7.1__

digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
537 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

__Python 3.5.2__

digits = ''.join([str(i) for i in range(10)]*10000000)
%timeit digits.isdigit() # --> 2.5X slower on python 3.7.1
215 ms ± 986 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

--> Based on comments from: https://bugs.python.org/msg329789
it appears this is a pure Unicode test. So it may be unrelated.

@cgohlke has posed a nice minimal example showing the slowdown: https://bugs.python.org/msg329790 Thanks! 👍

dragoljub on 13 Nov 2018

Thanks for the investigation @cgohlke - for 0.24 I suppose we should just shim in an ASCII isdigit
function?

(MUSL, MIT licensed)
https://github.com/esmil/musl/blob/master/src/ctype/isdigit.c#L5

chris-b1 on 18 Nov 2018

@chris-b1 I was thinking the same thing since its quite a simple function, however than changing locale would be limited. I wonder how the windows isdigit function ends up calling the locale version. I don't think that source is available.

dragoljub on 18 Nov 2018

I wonder how the windows isdigit function ends up calling the locale version. I don't think that source is available.

The source code for the Windows UCRT is available with recent Windows SDK. It is usually installed under C:\Program Files (x86)\Windows Kits\10\Source.

The isdigit and _isdigit_l functions are defined in ucrt\convert\_ctype.cpp:

extern "C" extern __inline int (__cdecl isdigit)(int const c)
{
    return __acrt_locale_changed()
        ? (_isdigit_l)(c, nullptr)
        : fast_check(c, _DIGIT);
}

extern "C" extern __inline int (__cdecl _isdigit_l)(int const c, _locale_t const locale)
{
    _LocaleUpdate locale_update(locale);

    return _isdigit_l(c, locale_update.GetLocaleT());
}

The following comment is from the _wsetlocale function:

// If no call has been made to setlocale to change locale from "C" locale
// to some other locale, we keep locale_changed = 0. Other functions that
// depend on locale use this variable to optimize performance for C locale
// which is normally the case in applications.

cgohlke on 19 Nov 2018

So if I’m understanding it correctly. Even if we set the locale in Python to “C” the windows isdigit function would still resort to calling the locale isdigit version slowing down parsing because the local has ‘changed’.

Is that the case in python. 3.7.0a3? Setting locale to “C” slows parsing down?

dragoljub on 19 Nov 2018

@jreback @TomAugspurger Do you think a simple shim of the isdigit function in the C parser code would be a fix we could entertain?

It would assume 'ASCII' compatible encoding for numeric columns which I think should cover all/most csv file encodings for digits.

int isdigit(int c)
{
    return (unsigned)c-'0' < 10;
}

dragoljub on 27 Nov 2018

Yeah, if you want to submit PR, ping me, or if not, I'll try to get to it soon

chris-b1 on 28 Nov 2018

@chris-b1 Go for it! 😄

dragoljub on 28 Nov 2018

@chris-b1 @jreback Thanks for getting this done! Really appreciate it! 👍 😄

dragoljub on 5 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

importing pandas breaks after successfully installing pandas with pip

rvernica · 46Comments

[Good first issue] TST: Disallow bare pytest.raises

ShaharNaveh · 51Comments

DEPR: let's deprecate

jreback · 42Comments

Use multi-row inserts for massive speedups on to_sql over high latency connections

maxgrenderjones · 48Comments

PERF: pandas import is too slow

mpenning · 48Comments

Pandas: read_csv() is 3.5X Slower in Pandas 0.23.4 on Python 3.7.1 vs Pandas 0.22.0 on Python 3.5.2

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV

INSTALLED VERSIONS

Output of pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV

INSTALLED VERSIONS

Most helpful comment

All 56 comments

INSTALLED VERSIONS

11790

Related issues

Output of `pd.show_versions() -- Latst Python 3.7.1 Pandas 0.23.4 : Slow Read CSV`

Output of `pd.show_versions() -- Older Python 3.5.2 Pandas 0.22.0 : Fast Read CSV`