Pandas: "too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue

Created on 17 May 2018  Â·  36Comments  Â·  Source: pandas-dev/pandas

#let's import packages to use
import numpy as np
import pandas as pd
from sqlalchemy import create_engine


# #use pandas to import data
df = pd.DataFrame(np.arange(0,20000,1))
#create the engine to connect pandas with sqlite3
engine = create_engine('sqlite://')
#create connection
conn = engine.connect()
#convert df to sql table
df.to_sql('test',engine, if_exists='replace',chunksize=1000)
#print results
result = conn.execute("select * from test")
for row in result:
    print(row['index'])
conn.close()


Problem description

In pandas 0.22 I could write a dataframe to sql of reasonable size without error. Now I receive this error "OperationalError: (sqlite3.OperationalError) too many SQL variables". I am converting a dataframe with ~20k+ rows to sql. After looking around I suspect the problem lies in the limit set by sqlite3: SQLITE_MAX_VARIABLE_NUMBER which is set to 999 by default (based on their docs). This can apparently be changed by recompiling sqlite and adjusting this variable accordingly. I can confirm that for a df of length (rows) 499 this works. I can also confirm that this test version works with a row length of 20k and a chunksize of 499 inputted with df_to_sql works. In my real case the limit is 76. These numbers are clearly dependent on data size of each row so a method is required to estimate this based on data type and number of columns. #19664

Expected #Output

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Trace

runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
Traceback (most recent call last):

File "", line 1, in
runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "H:/Tests/Pandas_0.23_test.py", line 19, in
df.to_sql('test',engine, if_exists='fail',chunksize=500)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\pandascore\generic.py", line 2127, in to_sql
dtype=dtype)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 450, in to_sql
chunksize=chunksize, dtype=dtype)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1149, in to_sql
table.insert(chunksize)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 663, in insert
self._execute_insert(conn, keys, chunk_iter)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 638, in _execute_insert
conn.execute(*self.insert_statement(data, conn))

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 948, in execute
return meth(self, multiparams, params)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 269, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1060, in _execute_clauseelement
compiled_sql, distilled_params

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1200, in _execute_context
context)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1413, in _handle_dbapi_exception
exc_info

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\utilcompat.py", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\utilcompat.py", line 186, in reraise
raise value.with_traceback(tb)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1193, in _execute_context
context)

File "C:\Users\kane.hillAppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 507, in do_execute
cursor.execute(statement, parameters)

OperationalError: (sqlite3.OperationalError) too many SQL variables [SQL: 'INSERT INTO test ("index", "0") VALUES (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?,

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

IO SQL Regression

Most helpful comment

Had the initial issue using 23.0 updating to 23.4 corrected the problem. Thanks for the fix.

All 36 comments

This is also an issue when connecting to MSSQL. It fails silently and will not stop the function or return an exception, but will throw one if you set "chunksize" manually.

@kristang: I just played around a bit more. Have you tried a chunksize limit of 75 or less? My case is working for the local db based on this chunksize and strangely 499 for a sqlite db in memory, neither of which are the limits stated by sqlite. Have you tried different chunksizes in your case?

I tried 999, 1000 and 2100, but then just ended up downgrading to 0.22 to get on with my day.

@tripkane could I bother you to narrow down the original example a bit? It should be minimal, and something that's appropriate for sticking in a unit test.

Could you also include the traceback you get?

@kristang: :) fair enough maybe try these lower numbers this out when you have time

@TomAugspurger: Hi Tom. Yep can do. So reduce it to my actual case (i.e. locally saved database i.e. not in memory and create the data from np.arange like the :memory: version?)

In-memory is fine if it reproduces the error.
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports may be
helpful.

On Thu, May 17, 2018 at 8:56 AM, tripkane notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger: Hi Tom. Yep can do. So
reduce it to my actual case (i.e. locally saved database i.e. not in memory
and create the data from np.arange like the :memory: version?)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/21103#issuecomment-389875161,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIqaA8WdrT73TquzvHzb6vuZNSnmGks5tzYGQgaJpZM4UDECk
.

@TomAugspurger: I have shortened it and updated the comments. Basically chunksize must be used and one needs to estimate this based on the data size of one row based on each individual case. This person implemented this solution with peewee: https://stackoverflow.com/questions/35616602/peewee-operationalerror-too-many-sql-variables-on-upsert-of-only-150-rows-8-c anyway I hope this helps and for those with this issue you can't go wrong in the meantime with a chunksize of something very small like 1 :)

@TomAugspurger: After further testing the initial assumption seems correct and is based on the SQLITE_MAX_VARIABLE_NUMBER limit of 999. The max allowable chunksize associated with pd.df_to_sql is given by chunksize=999//(cols+1) where cols was generated for the test by cols = np.random.randint(1,20) and used to create df = pd.DataFrame(np.random.random((20000, cols)))

@tripkane thanks for the minimal example! I can confirm this regression.

@jorisvandenbossche: You're welcome, happy to help.

Since you guys are looking into this... Can you guys check if the change that caused this regression actually improved performance?

As I reported on #8953 this change actually made the operation slower for me. And IMO should be reverted...

Best to open a separate issue to track that. Could you do some profiling as
well, to see where the slowdown is?

On Fri, May 18, 2018 at 3:12 AM, eduardo naufel schettino <
[email protected]> wrote:

Since you guys are looking into this... Can you guys check if the change
that caused this regression actually improved performance?

As I reported on #8953 https://github.com/pandas-dev/pandas/issues/8953
this change actually made the operation slower for me. And IMO should be
reverted...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/21103#issuecomment-390130704,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIo7h4aUvpJQF_pqhEpe0xB7sSZQsks5tzoJ2gaJpZM4UDECk
.

@TomAugspurger: I tested this quickly with essentially the same example as above for two different length df's 20k and 200k and compared chunksize = 1 with chunksize=999//(cols+1) with cols = 10 and in both cases it was twice as fast when connected to a local sqlite db.

@TomAugspurger I created issue #21146 to track this performance regression.
I did not profile it but did some experiments... my conclusion is that performance on 0.23 is better for a few columns but degrades quickly as the number of columns grow.

@pandas-dev/pandas-core How do people feel about reverting https://github.com/pandas-dev/pandas/pull/19664? It both breaks existing code and degrades performance significantly. Happy to put up the patch.

Reverting seems reasonable for now.

Is there anything we can salvage form #19664 so that users can opt into
multi-row inserts if they want?

On Tue, May 22, 2018 at 4:07 PM, Phillip Cloud notifications@github.com
wrote:

@pandas-dev/pandas-core
https://github.com/orgs/pandas-dev/teams/pandas-core How do people feel
about reverting #19664 https://github.com/pandas-dev/pandas/pull/19664?
It both breaks existing code and degrades performance significantly. Happy
to put up the patch.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/21103#issuecomment-391141157,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHItx5fraPyzUV6092imwIAO7NLIpcks5t1H4ogaJpZM4UDECk
.

cc @danfrankj

In the original issue (https://github.com/pandas-dev/pandas/issues/8953), it was discussed to have an optional keyword for this, as which default would be best depends a lot on the exact circumstances (this discussion was largely ignored in the PR, sorry about not raising this in the PR).
So I think making it into an option with the default how it was before, is certainly a possible way forward.

I think it would still be useful for someone with some time to go through the discussion in that issue to check the raises usecases, how it interacts with chunksize, etc, as I don't recall everything that was discussed there.

IMO pandas should allow more flexibility here: users pass in a function that will get called with the SQLAlchemy table and they can construct the DML themselves. The default DML construction function would be the previous working code. Then any such optimization that comes up in the future doesn't need yet another keyword argument. I wonder if there isn't already a way to do this without monkey patching.

@cpcloud As of today there is no way to achieve this without monkey patching.
I use a monkey patched version to use COPY ... FROM.

I could work on patch to make this more flexible. I suggest adding a parameter insert_method to to_sql(). Apart from the default (previous) implementation, we could add multi-value (the 0.23 version) and COPY FROM.

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

I like the proposed idea of allowing more flexibility. But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

@jorisvandenbossche

The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway.

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Also you would need to be careful with subclassing the table class as there are 2 implementations: SQLAlchemy and SQLite.

Another problem of monkeypatching/subclassing (opposed to parameter) is that the same method has to be used globally by an application, and can not be fine tuned for each call of to_sql().

But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

  • a reference to table object
  • a connection
  • list of keys/columns
  • and data_iter with values

That was enough for me to implement postgresql COPY without having to repeat Pandas specific processing of columns and conversion of values.

to_sql() has no direct control of which table class will be used. So currently, subclassing is not enough, you MUST monkeypatch.

Yeah, for to_sql that is true, but in principle you can subclass and then use the methods of that class. With emphasis on "can", as I also don't really recommend it as I said in my previous comment.

Anyway, let's leave that discussion as I think we agree that we need some way to customize it (at least to switch between the single / multi values to solve the regression).

I think just follow the API already used for _execute_insert(self, conn, keys, data_iter)

Would there be occasions where a user would want the actual dataframe instead of data_iter?

@cpcloud what do you think of this interface?

@schettino72 One option (for the PR) would also be to implement this parameter, but leave the postgres-copy one as an example of how to use this (to not add an option that only works for a single engine for now; although given its superior speed, we should really think how to deal with this).

Hi all! My apologies for the regression introduced. I like the solution in https://github.com/pandas-dev/pandas/pull/21199 much better.

For context, the motivation for needing a multivalues insert was to interact with a presto database. In that case, a normal insert occurs row by row incurring a network call thus making it prohibitively slow. This may be the case for other analytic databases as well (redshift, bigquery).

But definitely we should not have this be the default so thank you @schettino72 for cleaning this up and again apologies.

Just verified that multivalues has this behavior for Redshift as well.

In [60]: df = pd.DataFrame({'foo': np.arange(100)})

In [61]: conn.dialect
Out[61]: <sqlalchemy.dialects.postgresql.psycopg2.PGDialect_psycopg2 at 0x11d8e9748>

In [62]: conn.dialect.supports_multivalues_insert = False

In [63]: %time df.to_sql('test_no_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 21.4 ms, sys: 11.6 ms, total: 32.9 ms
Wall time: 38.7 s

In [64]: conn.dialect.supports_multivalues_insert = True

In [65]: %time df.to_sql('test_multivalues_insert', con=conn, schema='default', index=False)
CPU times: user 7.88 ms, sys: 2.38 ms, total: 10.3 ms
Wall time: 4.26 s

@schettino72 do you have time to work on the PR?
Other short-term option for 0.23.1 is to revert the original change, and target the PR for 0.24.0

@schettino72 do you have time to work on the PR?

Sure. I was waiting for more feedback on API. From my side I guess just need to update docstrings and docs...

Should I target 0.24 or 0.23.1?

I think 0.23.1

On Wed, Jun 6, 2018 at 8:21 AM, eduardo naufel schettino <
[email protected]> wrote:

@schettino72 https://github.com/schettino72 do you have time to work on
the PR?

Sure. I was waiting for more feedback on API. From my side I guess just
need to update docstrings and docs...

Should I target 0.24 or 0.23.1?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/21103#issuecomment-395066047,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIlJZq1B39D6rZYw-_FFYs4NoazEiks5t59dagaJpZM4UDECk
.

Yes, we certainly want to do something for 0.23.1 (at a minimum reverse). But it would be nice to at once get the option in there as well.
I am only not fully sure about the function interface (in case this needs more discussion for the exact API of the passed function). Maybe we could limit it for 0.23.1 only to adding method='default'/'multi', and leave the ability to pass a function for 0.24.0

Had the initial issue using 23.0 updating to 23.4 corrected the problem. Thanks for the fix.

On 1.0.1 and still facing the same issue

@theSuiGenerisAakash try removing "method='multi'" . This might help

reduce to see whether it will work
chunksize=100

@theSuiGenerisAakash try removing "method='multi'" . This might help

This defeats the purpose

Happen on pandas 1.0.1 , this issue should reopen!
session is sqlite without method='multi'

df_current.to_sql(data_schema.__tablename__, db_engine, index=False, if_exists='append')

os: win 10 , seems some people don't have this problem on linux and mac .

Was this page helpful?
0 / 5 - 0 ratings