Google-cloud-python: BigTable: Memory usage abnormally increasing when using table.read_rows method

Created on 13 Dec 2017 · 6Comments · Source: googleapis/google-cloud-python

Hi,

While using the python google-cloud-bigtable to read a large collection of data (10M to 90M rows of at least 4MB to 5MB per row), I noticed a strange behavior: The used RAM was at lease twice the size of the data I was reading from the table.

Here is the code I did to reproduce :

from google.cloud import bigtable

instance_name = "the-instance"
table_name = "the-table"

client = bigtable.Client(admin=False)
instance = client.instance(instance_name)
table = instance.table(table_name)

last_readall_key = None
rows = table.read_rows(start_key=last_readall_key)

while True:
    try:
        rows.consume_next()
    except StopIteration as e:
        break
    except Exception as e:
        print 'Got exception of type %s : %s' % (type(e), str(e)) # mostly grpc._channel._Rendezvous exceptions are thrown here

        table = instance.table(table_name)
        rows = table.read_rows(start_key=last_readall_key)

        rows.consume_next()

    for key in rows.rows:
        last_readall_key = key

    rows._rows.clear() #this was an attempt to fix, but it doesn't have any effect

When running the above code, my instance RAM fills itself until full.

Another weird thing, is that the problem does not happen on my MacOS Sierra 10.12.6 machine. I only witness the problem on Linux machines.

I saw this problem on Linux Ubuntu 14.04 LTS and Linux Ubuntu 10.04.5 LTS. I did not try with other Linux distributions or others versions of Ubuntu though.

Am I doing something wrong ? Does anyone knows what may cause this problem ?
Would anyone have any trick to avoid this problem ?

Find below the pip requirements to reproduce :

cachetools==2.0.1
certifi==2017.11.5
chardet==3.0.4
dill==0.2.7.1
enum34==1.1.6
future==0.16.0
futures==3.2.0
google-api-core==0.1.2
google-auth==1.2.1
google-cloud-bigtable==0.28.1
google-cloud-core==0.28.0
google-gax==0.15.16
googleapis-common-protos==1.5.3
grpcio==1.7.3
idna==2.6
ply==3.8
protobuf==3.5.0.post1
pyasn1==0.4.2
pyasn1-modules==0.2.1
pytz==2017.3
requests==2.18.4
rsa==3.4.2
six==1.11.0
urllib3==1.22

(these dependancies are installed automatically when using pip install google-cloud-bigtable )

Thanks for your time.

bug bigtable p2

Source

Rafff

Most helpful comment

FYI, we're working on a yield_row implementation (https://github.com/zakons/google-cloud-python/tree/feature/row_iterator) that gets rid of the use of a map to track returned rows. In other languages, we use iterators that return rows rather than using an intermediary map. That type of iterator seems more like how a user thinks about interacting with a stream of results from Cloud Bigtable.

sduskis on 19 Dec 2017

👍2

All 6 comments

Hi, does anyone have an idea of how to solve this ?
@chemelnucfin this is much more of a bug than a question.

Rafff on 19 Dec 2017

@Rafff Could you give an idea of how much data is in the table?

dhermes on 19 Dec 2017

sduskis on 19 Dec 2017

👍2

@dhermes the table is huge, it contains almost 90M rows where each row contains 1 cell of 4000Bytes.

@sduskis this is awesome ! That's exactly what I was trying to achieve, I was using the read_rows to yield row by row to my application. When do you think this could be released ?

Thanks for your time !

Rafff on 20 Dec 2017

@Rafff Please follow along in #4679. I will now close this issue. If you have any problems or concerns feel free to address them here or in #4679 or open a new issue. Thank you for your patience.