Hi,
While using the python google-cloud-bigtable to read a large collection of data (10M to 90M rows of at least 4MB to 5MB per row), I noticed a strange behavior: The used RAM was at lease twice the size of the data I was reading from the table.
Here is the code I did to reproduce :
from google.cloud import bigtable
instance_name = "the-instance"
table_name = "the-table"
client = bigtable.Client(admin=False)
instance = client.instance(instance_name)
table = instance.table(table_name)
last_readall_key = None
rows = table.read_rows(start_key=last_readall_key)
while True:
try:
rows.consume_next()
except StopIteration as e:
break
except Exception as e:
print 'Got exception of type %s : %s' % (type(e), str(e)) # mostly grpc._channel._Rendezvous exceptions are thrown here
table = instance.table(table_name)
rows = table.read_rows(start_key=last_readall_key)
rows.consume_next()
for key in rows.rows:
last_readall_key = key
rows._rows.clear() #this was an attempt to fix, but it doesn't have any effect
When running the above code, my instance RAM fills itself until full.
Another weird thing, is that the problem does not happen on my MacOS Sierra 10.12.6 machine. I only witness the problem on Linux machines.
I saw this problem on Linux Ubuntu 14.04 LTS and Linux Ubuntu 10.04.5 LTS. I did not try with other Linux distributions or others versions of Ubuntu though.
Am I doing something wrong ? Does anyone knows what may cause this problem ?
Would anyone have any trick to avoid this problem ?
Find below the pip requirements to reproduce :
cachetools==2.0.1
certifi==2017.11.5
chardet==3.0.4
dill==0.2.7.1
enum34==1.1.6
future==0.16.0
futures==3.2.0
google-api-core==0.1.2
google-auth==1.2.1
google-cloud-bigtable==0.28.1
google-cloud-core==0.28.0
google-gax==0.15.16
googleapis-common-protos==1.5.3
grpcio==1.7.3
idna==2.6
ply==3.8
protobuf==3.5.0.post1
pyasn1==0.4.2
pyasn1-modules==0.2.1
pytz==2017.3
requests==2.18.4
rsa==3.4.2
six==1.11.0
urllib3==1.22
(these dependancies are installed automatically when using pip install google-cloud-bigtable )
Thanks for your time.
Hi, does anyone have an idea of how to solve this ?
@chemelnucfin this is much more of a bug than a question.
@Rafff Could you give an idea of how much data is in the table?
FYI, we're working on a yield_row implementation (https://github.com/zakons/google-cloud-python/tree/feature/row_iterator) that gets rid of the use of a map to track returned rows. In other languages, we use iterators that return rows rather than using an intermediary map. That type of iterator seems more like how a user thinks about interacting with a stream of results from Cloud Bigtable.
@dhermes the table is huge, it contains almost 90M rows where each row contains 1 cell of 4000Bytes.
@sduskis this is awesome ! That's exactly what I was trying to achieve, I was using the read_rows to yield row by row to my application. When do you think this could be released ?
Thanks for your time !
@Rafff Please follow along in #4679. I will now close this issue. If you have any problems or concerns feel free to address them here or in #4679 or open a new issue. Thank you for your patience.
@chemelnucfin Let's leave this issue open until the fix (#4679) is merged.
Most helpful comment
FYI, we're working on a yield_row implementation (https://github.com/zakons/google-cloud-python/tree/feature/row_iterator) that gets rid of the use of a map to track returned rows. In other languages, we use iterators that return rows rather than using an intermediary map. That type of iterator seems more like how a user thinks about interacting with a stream of results from Cloud Bigtable.