Hello,
I want to upload data to an existing table in BigQuery. Therefore I looked in the docks and found bigquery.insertAll which works fine but it takes very long to execute my code. I think it麓s because I upload my data row by row and every row takes about a second. In the docs I read about 10.000 rows per second but I don`t see a way to change my code in a way so that it can do so.
Is there a smart and especially fast way to upload data in BigQuery. I considered to get my data into Arraylists or Json Objects and upload it in one step but I couldn`t find a way which would do so.
Here is my Code to check what I did:
public static void uploaddata(String datasetname) throws IOException {
BigQuery bigquery = BigQueryOptions
.getDefaultInstance()
.toBuilder()
.setProjectId("testprojekt-175408")
.build()
.getService();
TableId tableIdor = TableId.of(datasetname, "review_test");
String csvFile = "C:/Users/Marku/Desktop/testfile2.csv";
BufferedReader br = null;
FileReader myFile = null;
String line = "";
String cvsSplitBy = ";";
String[] beschriftung = null;
int i = 0;
Map<String, Object> rowContent = new HashMap<>();
myFile = new FileReader(csvFile);
br = new BufferedReader(myFile);
// read CSV file line by line and upload it into BigQuery
while ((line = br.readLine()) != null) {
// get the name of the fields from the first row of the CSV File
if (i == 0) {
beschriftung = line.split(cvsSplitBy);
i = i + 1;
for (int e = 0; e < beschriftung.length; e++) {
rowContent.put(beschriftung[e], "init");
}
} else
// Format Data for BigQuery and upload the row
{
String[] Zeile = line.split(cvsSplitBy);
for (int e = 0; e < Zeile.length; e++) {
rowContent.put(beschriftung[e], Zeile[e]);
}
i = i + 1;
}
InsertAllResponse response = bigquery
.insertAll(InsertAllRequest
.newBuilder(tableIdor)
.addRow(String.valueOf(i), rowContent)
.build());
if (response.hasErrors()) {
System.out.println(response.getErrorsFor(0));
}
}
br.close();
myFile.close();
}
}
As you can see I get my data from a CSV file. There are several reasons why I don鈥檛 upload just the CSV with a load job:
thanks in advance for your help.
For the best performance when streaming data into BigQuery using Table.insertAll it is recommended that you batch multiple rows (up to 10,000, but recommended 500) in the same request. From the BigQuery quotas page:
Maximum rows per request: 10,000 rows per request. We recommend a maximum of 500 rows. Batching can increase performance and throughput to a point, but at the cost of per-request latency. Too few rows per request and the overhead of each request can make ingestion inefficient. Too many rows per request and the throughput may drop. We recommend using about 500 rows per request, but experimentation with representative data (schema and data sizes) will help you determine the ideal batch size.
If you have a CSV file, it's best to use a load job. BigQuery has an option called writeDisposition, which when set to WRITE_APPEND, will append data to a table if the table already exists.
Thank you very much for your answer and for pointing out that there is the writeDisposition option in BigQuery.
Of course, I can use a load job to upload my CSV file, but as I know then my CSV file must have a certain format. For example, I must format a column with a TIMESTAMP in a way BigQuery can understand. Otherwise my load job would fail.
Since we are using complete different TIMESTAMP formatting in Germany it would make an upload very uncomfortable.
Therefore, I want to change my CSV file row by row and upload it. Do I need to create a temporary file for uploading it with a load job then?
What do you mean by batch multiple rows?
Does my code have to look like this to do so?
InsertAllResponse response = bigquery
.insertAll(InsertAllRequest
.newBuilder(tableIdor)
.addRow(rowContent)
.addRow(rowContent)
.addRow(rowContent)
.addRow(rowContent)
.addRow(rowContent)
.addRow(rowContent)
.
.
.
.build());
Because this is a very bad way to write code I think.
Regarding the load job method, you don't necessarily need to write to a temporary file. This sample shows how to make a load job using in-memory CSV data: https://github.com/GoogleCloudPlatform/google-cloud-java/blob/8f7792580e9e3de9d3b8e1fb5f61f9dba413a660/google-cloud-examples/src/main/java/com/google/cloud/examples/bigquery/snippets/BigQuerySnippets.java#L328-L353
If you wish to use the insertAll method, your code sample will work, but note that you can build the InsertAllRequest incrementally using InsertAllRequest.Builder.
InsertAllRequest.Builder request = InsertAllRequest.newBuilder(tableIdor);
for (int i = 0; i < someSize; i++) {
// Do some transformation to get rowContent
request = request.addRow(rowContent);
}
InsertAllResponse response = bigquery.insertAll(request.build());
Thanks for your fast answer. I will try your solutions.
Most helpful comment
Regarding the load job method, you don't necessarily need to write to a temporary file. This sample shows how to make a load job using in-memory CSV data: https://github.com/GoogleCloudPlatform/google-cloud-java/blob/8f7792580e9e3de9d3b8e1fb5f61f9dba413a660/google-cloud-examples/src/main/java/com/google/cloud/examples/bigquery/snippets/BigQuerySnippets.java#L328-L353
If you wish to use the insertAll method, your code sample will work, but note that you can build the InsertAllRequest incrementally using InsertAllRequest.Builder.