Azure-sdk-for-js: cosmos bulk executor for nodejs?

Created on 10 Dec 2019  Â·  12Comments  Â·  Source: Azure/azure-sdk-for-js

Is your feature request related to a problem? Please describe.
All the other nosql db's I use have bulk insert / upsert capabilities. I was expecting the same from cosmos.

Describe the solution you'd like
Either a bulk executor for nodejs or an extension to the api to take multiple documents in one api call.

Describe alternatives you've considered
As described here -- https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview
all we can do us bulk async single inserts

Thanks! Chad

Client Cosmos customer-reported

Most helpful comment

@chadbr @PatrickQuintal FYI we have support now for bulk operations at the REST API level. I'm going to be tracking adding bulk APIs to the SDK over at https://github.com/Azure/azure-sdk-for-js/issues/7479 where I have some early implementation ideas.

All 12 comments

I'd 100% like to see the ability to do "bulk" operations within the @azure/cosmos package.

Currently for windows/.NET, there is

  1. a tool.
    https://github.com/azure/azure-documentdb-datamigrationtool
  1. a nuget "extension" for another nuget package (This is half way deprecated? I guess? Its pretty confusing)
    https://github.com/Azure/azure-cosmosdb-bulkexecutor-dotnet-getting-started

  2. A new cosmos library for .NET, similar to what they did with nodejs.
    https://github.com/Azure/azure-cosmos-dotnet-v3
    It basically collects requests over a 2 second timespan, then does them.

I believe all the above links just create an sproc on the database temporarily, then call it using the api with a chunked array of JSON documents.

It'd be nice for that same functionality to be in @azure/cosmos.

@chadbr @PatrickQuintal Thanks for opening the issue. This is on my radar, but I don't have a firm ETA yet. We are still evaluating the best way to do this in JS. Can botn of you expand on your use case for this feature?

@chadbr you mention collecting and executing data during a set time period (2s). Would this be useful outside of a migration scenario? It would be hard to surface errors to the calling code if this was being used inside something like a REST API and would set a latency floor for the application.

For migrations, I highly recommend sticking with the data migration tool linked above or using Azure Data Factory. They are specifically designed for this use case.

As @chadbr mentioned, this is possible with SPROCs. That is essentially what happens under the hood in the bulk executor library today. But the benefits are a little dependant on how many partitions you have and how evenly your writes are spread out.

If you are just looking for an API method, that is something I could see adding now. For example, here is how I would do bulk today:

/* const client = create client omitted */
const container = client.databse('production').container('todos')

async function bulkCreate(todos) {
  const creates = []
  for (const todo of todos) {
    creates.push(contaier.items.create(todo))
  }
  return Promise.all(creates)
}

await bulkCreate([{ id: 1 }, { id: 2 }, { id: 3 }])

We could add new API that does the same thing as the above:

container.items.bulkCreate([{ id: 1 }, { id: 2 }, { id: 3 }])

@southpolesteve

My current use case is,
I've got an on-prem mongoDb that is pretty "legacy" (In the sense of poorly designed). I'm looking to migrate this to CosmosDB SQL API. In between, I want to do some data transformation to utilize cosmosDb features.

  1. I run a pretty non-windows development environment, so its a bit annoying to have to crack out the windows VM to develop.
  2. Data Factory isn't a silver bullet for my use case due to my need to do data transformation which isn't possible for mongoDb > cosmosDb SQL. It also is a bit hard to manage within a team.
  3. I need to do data transformation, so there is something fishy about applying strongly typed notations (using the cosmosDB .NET sdk) to both input/output when they are both dynamic.

I should add, our entire development environment is Node JS, so its a bit weird to have all NodeJs, then have to write some C# to perform data transformation.

That being said. There's no reason we should conflate two features together.

I see two features; the bulk execution (for migration purposes), and a "helper" API to help create & use an sproc for Bulk inserts, upserts and deletes within a partition.

Out of the two, I'd personally prefer to see the latter. The 2s bulk executor is very.... niche.

@PatrickQuintal Got it. I can certainly appreciate not wanting to run a VM for Cosmos. I do most of my dev on non-Windows too.

Can you clarify the exact problem this would solve? Are you not able to move data fast enough from Mongo to Cosmos? Roughly what speeds are you seeing now and what is the total data size?

@southpolesteve

I guess my scenarios are a little different. I'm not concerned so much with migrations (although it's certainly a valid use case).

We have a UI where a user will be pasting 1000's of 'rows' into a grid and we're pushing these rows to the server as array chunks to reduce network traffic.
On the server (nodejs) we validate and then forward the arrays to the database for 'bulk upsert'.

It's very common for the user to paste data in columns -- i.e. user pastes 10,000 rows of one column resulting in 10k inserts, then pastes 10,000 rows into another column resulting in 10k updates on the same rows.

Doing these types up IO as individual calls (even as Promise.All) is very very slow (unusable).

The benefits of array upserts are huge:
-removes executing 1000's of promises in node
-reduces 1000's of network calls from the server to Cosmos to 10's

--
MongoDB has insertMany, updateMany - https://docs.mongodb.com/manual/reference/method/db.collection.insertMany/

Google Datastore allows you to simply pass an array of {key, item}[]

--

We can use the spoc updates (have had to do these in SQL databases for years...), but it seems like something that should be part of the API (like all the other db's...)

--

Thanks, Chad

@chadbr Thanks for the info. It is super helpful to have the use cases laid out when I bring this back to the team.

I need to sync with the rest of the SDK folk on how we'll tackle this. .NET recently added a new bulk API that doesn't use sprocs, but I'm not sure yet how easily we can port it to JS.

It is not ideal, but In the meantime, you can always use sprocs to do this yourself.

@PatrickQuintal Got it. I can certainly appreciate not wanting to run a VM for Cosmos. I do most of my dev on non-Windows too.

Can you clarify the exact problem this would solve? Are you not able to move data fast enough from Mongo to Cosmos? Roughly what speeds are you seeing now and what is the total data size?

@southpolesteve
I suppose I don't really have a problem now since I'm just planning to do some sproc stuff.

At the time it would of been nice if I could of just
while (stuff) { await container.bulkInsert(array, options); }
and the package just handles making the sproc and using it. Instead of fudging around with a bunch of different things.

Perhaps even just a "this is how you move data in bulk programmatically with cosmosdb and nodejs"

Currently all the MS samples point you to about 3-4 different things that realistically would never help anyone.

(I'm looking at you Mr Azure Document that tells people its okay to just migrate your mongoDB to cosmosDb without any changes)

I think @PatrickQuintal is saying essentially the same thing I am... Bulk insert/upsert API.

@chadbr @PatrickQuintal FYI we have support now for bulk operations at the REST API level. I'm going to be tracking adding bulk APIs to the SDK over at https://github.com/Azure/azure-sdk-for-js/issues/7479 where I have some early implementation ideas.

Awesome!

Was this page helpful?
0 / 5 - 0 ratings