Hello all,
The documentation [1] makes it clear that http2lib objects aren't thread safe in the Python client library. Are clients that have gRPC support (such as Pubsub) thread safe when using gRPC? This question has been asked with the Java client library before [2] but I'd appreciate a firm answer for Python too.
Thank you.
[1] https://developers.google.com/api-client-library/python/guide/thread_safety
[2] https://github.com/GoogleCloudPlatform/google-cloud-java/issues/1320
Hi @AdamLazarus,
Thanks for asking.
The short answer is: We _think_ so. :-)
(Additionally, if you find thread-safety issues, feel free to open them as bugs.)
Just for information, it seems that you cannot share your datastore.Client() object across all the threads. You're going to have something that looks like this:
E1130 10:54:55.377618000 140736526345152 ssl_transport_security.c:435] Corruption detected.
E1130 10:54:55.377821000 140736526345152 ssl_transport_security.c:411] error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
E1130 10:54:55.377891000 140736526345152 secure_endpoint.c:185] Decryption error: TSI_DATA_CORRUPTED
@philipperemy I'd love to see an example that reproduces this. I've used Client()-s based on gRPC connections across multiple threads without issue.
Sure! This is roughly the code where I have one datastore.Client() per thread:
from google.cloud import datastore
def get_data(symbol_):
print('Init...')
data_store_client = datastore.Client()
print('Done...')
query = data_store_client.query(kind=symbol_)
query_iter = query.fetch()
print_once = True
for entity in query_iter:
print(entity)
def parallel_function(f, sequence, num_threads=None):
from multiprocessing import Pool
pool = Pool(processes=num_threads)
result = pool.map(f, sequence)
cleaned = [x for x in result if x is not None]
pool.close()
pool.join()
return cleaned
def run_query():
[...]
parallel_function(f=get_data, sequence=symbols, num_threads=4)
The other code is very similar except that I define a global variable DATA_STORE_CLIENT and this variable is visible across all the threads.
Both code do not work.
When num_threads=1 it runs smoothly.
Has this ever been addressed? Creating a new client for each thread can effectively double the number of threads in the system.
@speedplane if you want something that can run in production, you might want to use something else. Those libs are not very stable unfortunately.
@philipperemy what other options are there for accessing the datastore? Isn't this the official library?
I'm looking at the code now, and it's much worse than 1 new thread per client. It seems that when using gRPC, there are 4 threads: a consumption thread, a channel spin thread, a delivering thread, and a polling thread. (I'm not sure what these threads do or if they're always used). This seems to be per client, and can get bad, take the following example:
That results in 720 threads (= 4 * 20 * (1 + 2 * 4)) when 80 would have worked fine.
@speedplane We expect that gRPC-based clients to be thread safe: the issues we know of are to do with multiprocessing (forking after creating a client).
Most helpful comment
Sure! This is roughly the code where I have one
datastore.Client()per thread:The other code is very similar except that I define a global variable
DATA_STORE_CLIENTand this variable is visible across all the threads.Both code do not work.
When
num_threads=1it runs smoothly.