I want to copy my MySQL data (>200 Mio. rows) to BigQuery. Therefore I created a python script, which uses this library. At the moment it streams 1000 rows with one request and it generates about 1,1 requests/second. This is not really fast and it would take me days to transfer the whole dataset. I am sure that this can be optimized, but I don't know how. Would you have some suggestions? You can find my source code here
I thought about the following points:
@tswast @jonparrott Any suggestions? Is it possible to write the raw data to GCS and then point BigQuery at a blob or bucket?
Yes, I would recommend dumping to CSV and loading to BigQuery rather than reading from MySQL and writing to BigQuery in a script.
Use mysqldump to create a CSV file for each table with the --tab=dir_name, --fields-terminated-by=,, and --fields-enclosed-by='"' options and write the CSV files to Cloud Storage.
Create a load job from Cloud Storage using load_table_from_storage().
Also, in terms of uploading a massive CSV to GCS, I recommend using the gcloud / gsutil CLI, it is VERY optimized for speed.
@tswast Hi Tim, thanks for your answer! Using MySQL/gsutil was my first attempt to migrate the data, but I had some problems with Null-values. But I'll try the same with mysqldump, maybe it uses a different format to store Null-values.
@dhermes Thanks for your support! :)
That's good feedback. I'll check with the rest of the BigQuery team to see what format is expected for loading Null values.
@inkrement I'm closing out since there isn't much to go forward to from here?
@dhermes sure, it seems like mysqldump behaves differently than mysql and so this helped me a lot!
@tswast Seems like mysqldump uses \N as null value, which is great. I am still exporting at the moment and so I don't know if the quotes are escaped correctly. Would be great if you could add some sentences to the BigQuery documentation :)
@tswast Seems like the Null-problem is solved now, but quotes in quotes do not work:
MySQL allows them and escapes them using \, but BigQuery's big load does not allow to specify a escape character and uses " to escape quotes in strings (as suggested in RFC 4180).
Too many errors encountered. (error code: invalid)
/gzip/subrange//bigstore/imsm-00.csv.gz: Error detected while parsing row starting at position: 21470. Error: Data between close double quote (") and field separator. (error code: invalid)
Any suggestions?
@inkrement Please file an issue on the BigQuery public tracker. https://code.google.com/p/google-bigquery/
The public issue for BigQuery supporting the MySQL CSV formats also includes some work-arounds.
@tswast haha.. thanks, but this is my issue. But as I mentioned in this thread this is not a nice solution. I would have to scan all my data and find a symbol, which does not occur.. but whats next? If I add a new row to my database the next day, which includes exactly this symbol I have to find another one. MySQL is a really famous database and its format is used by a lot of other databases (MemSQL, MariaDB,...). Would be great if google would support some tools to migrate data. I would like to spend my money on queries and google storage and not for "services" like alooma.
Most helpful comment
@tswast haha.. thanks, but this is my issue. But as I mentioned in this thread this is not a nice solution. I would have to scan all my data and find a symbol, which does not occur.. but whats next? If I add a new row to my database the next day, which includes exactly this symbol I have to find another one. MySQL is a really famous database and its format is used by a lot of other databases (MemSQL, MariaDB,...). Would be great if google would support some tools to migrate data. I would like to spend my money on queries and google storage and not for "services" like alooma.