Google-cloud-python: Question: How to optimize load_data-Operation

Created on 23 Jan 2017  路  11Comments  路  Source: googleapis/google-cloud-python

I want to copy my MySQL data (>200 Mio. rows) to BigQuery. Therefore I created a python script, which uses this library. At the moment it streams 1000 rows with one request and it generates about 1,1 requests/second. This is not really fast and it would take me days to transfer the whole dataset. I am sure that this can be optimized, but I don't know how. Would you have some suggestions? You can find my source code here

I thought about the following points:

  • Each request contains 1000 rows, should I choose a bigger number?
  • Does this library use gzip per default?
question bigquery

Most helpful comment

@tswast haha.. thanks, but this is my issue. But as I mentioned in this thread this is not a nice solution. I would have to scan all my data and find a symbol, which does not occur.. but whats next? If I add a new row to my database the next day, which includes exactly this symbol I have to find another one. MySQL is a really famous database and its format is used by a lot of other databases (MemSQL, MariaDB,...). Would be great if google would support some tools to migrate data. I would like to spend my money on queries and google storage and not for "services" like alooma.

All 11 comments

@tswast @jonparrott Any suggestions? Is it possible to write the raw data to GCS and then point BigQuery at a blob or bucket?

Yes, I would recommend dumping to CSV and loading to BigQuery rather than reading from MySQL and writing to BigQuery in a script.

Use mysqldump to create a CSV file for each table with the --tab=dir_name, --fields-terminated-by=,, and --fields-enclosed-by='"' options and write the CSV files to Cloud Storage.

Create a load job from Cloud Storage using load_table_from_storage().

Also, in terms of uploading a massive CSV to GCS, I recommend using the gcloud / gsutil CLI, it is VERY optimized for speed.

@tswast Hi Tim, thanks for your answer! Using MySQL/gsutil was my first attempt to migrate the data, but I had some problems with Null-values. But I'll try the same with mysqldump, maybe it uses a different format to store Null-values.

@dhermes Thanks for your support! :)

That's good feedback. I'll check with the rest of the BigQuery team to see what format is expected for loading Null values.

@inkrement I'm closing out since there isn't much to go forward to from here?

@dhermes sure, it seems like mysqldump behaves differently than mysql and so this helped me a lot!

@tswast Seems like mysqldump uses \N as null value, which is great. I am still exporting at the moment and so I don't know if the quotes are escaped correctly. Would be great if you could add some sentences to the BigQuery documentation :)

@tswast Seems like the Null-problem is solved now, but quotes in quotes do not work:

MySQL allows them and escapes them using \, but BigQuery's big load does not allow to specify a escape character and uses " to escape quotes in strings (as suggested in RFC 4180).

Too many errors encountered. (error code: invalid)
/gzip/subrange//bigstore/imsm-00.csv.gz: Error detected while parsing row starting at position: 21470. Error: Data between close double quote (") and field separator. (error code: invalid)

Any suggestions?

@inkrement Please file an issue on the BigQuery public tracker. https://code.google.com/p/google-bigquery/

The public issue for BigQuery supporting the MySQL CSV formats also includes some work-arounds.

@tswast haha.. thanks, but this is my issue. But as I mentioned in this thread this is not a nice solution. I would have to scan all my data and find a symbol, which does not occur.. but whats next? If I add a new row to my database the next day, which includes exactly this symbol I have to find another one. MySQL is a really famous database and its format is used by a lot of other databases (MemSQL, MariaDB,...). Would be great if google would support some tools to migrate data. I would like to spend my money on queries and google storage and not for "services" like alooma.

Was this page helpful?
0 / 5 - 0 ratings