Am working with big-sized files with formatted json lines in the form:
jsn = [{'a':1.23,'b':324.3242,'c':-.2343242},{'a':21.23,'b':3.3242,'c':-3.2343242}] # this is example of one line
The goal is to concatenate the dict values into numpy array.
The error thrown by @njit is: - argument 2: Cannot type list element of <class 'dict'>
I understand the issue and was trying to find some workaround. Since the #4848 is one of the few posts on similar issue (but w/o code), I am posting here minimum example:
from numba import njit
@njit # works nicely w/o njit
def store_func(data_store,rows_lim,jsn):
for ii in range(len(jsn)):
data_store = np.concatenate((data_store[-rows_lim:,:], np.array([[ jsn[ii]['a'],jsn[ii]['b'],jsn[ii]['c'] ]])),axis=0)
return data_store
def on_message(data_store,rows_lim):
jsn = [{'a':1.23,'b':324.3242,'c':-.2343242},{'a':21.23,'b':3.3242,'c':-3.2343242}] # the json line
data_store = store_func(data_store,rows_lim,jsn)
data_store = np.ones([0,3],float) # init
rows_lim = 1000
%timeit on_message(data_store,rows_lim) # jupyter notebook
Now imagine you have millions of on_message calls with hundreds of jsn elements len(jsn)>>100. Is there any reasonable workaround? Thank you, and thank you for the awesome library.
Thanks for the report. Numba's discussion forum is a great place to ask this sort of thing https://numba.discourse.group/c/numba/, there's a What is this error message? category along with a place for asking for more general help. As a brief answer, if the goal is to speed up creation of a NumPy array from json records, Numba is not going to be able to do much with your code (see what Numba is good at in the 5 minute guide to Numba) and the operations are likely memory bound anyway. With this in mind, something like this should help:
import numpy as np
DATA_SIZE = 10000
def store_func(data_store,rows_lim,jsn):
for ii in range(len(jsn)):
data_store = np.concatenate((data_store[-rows_lim:,:], np.array([[ jsn[ii]['a'],jsn[ii]['b'],jsn[ii]['c'] ]])),axis=0)
return data_store
jsn = [{'a':1.23,'b':324.3242,'c':-.2343242}] * DATA_SIZE
data_store = np.ones([0,3], np.float64)
rows_lim = 1000
gold = store_func(data_store, rows_lim , jsn)
%timeit store_func(data_store, rows_lim, jsn)
def quicker_store_func(rows_lim, jsn):
ljsn = len(jsn)
# not sure what rows_lim is for?
if rows_lim > ljsn:
n = ljsn
else:
n = rows_lim + 1
data_store = np.empty((n, 3))
for ii in range(n):
data_store[ii] = jsn[ii]['a'], jsn[ii]['b'], jsn[ii]['c']
return data_store
check = quicker_store_func(rows_lim, jsn)
np.testing.assert_allclose(gold, check)
%timeit quicker_store_func(rows_lim, jsn)
which gives a 64x improvement:
138 ms 卤 160 碌s per loop (mean 卤 std. dev. of 7 runs, 10 loops each)
2.15 ms 卤 9.57 碌s per loop (mean 卤 std. dev. of 7 runs, 100 loops each)
this improvement is largely from two things.
np.array, temporary lists from doing [[jsn[ii]['a']... etc]] and then more temporary arrays from doing a concatenate in a loop where the output array is thrown away and reassigned in each loop, all add up.Hope this helps.
And to answer what the error message is, Numba doesn't support Python dictionaries, they have to be converted to numba.typed.Dict instances first: https://numba.readthedocs.io/en/stable/reference/pysupported.html#typed-dict
Thank you, I appreciate the answer.
Good point 2. regarding the temporaries. Keep it simple is always a good way or at least first to try :)