Bulk insert into a table with PK is significantly faster if input data is in order (no need to write to tempdb first and run a sort). Need to add the ORDER(...) hint to the tdscommand string.
It looks like this has been on Connect since November 2007: https://connect.microsoft.com/SQLServer/feedback/details/308741/sqlbulkcopy-class-does-not-support-hints
Would this just be another case over here, but with maybe another property on SqlBulkCopy that has the extra info needed to be able to do this? Maybe it would actually be some sort of collection of (column name, sort direction) pairs, and if it's non-empty, then SqlBulkCopy propagates that as a proper hint (which means another condition up here for entering this block)?
It just feels really arbitrary for me to sit here measuring the relative characteristics of two presumably inefficient solutions, one based on SqlBulkCopy (with the tempdb + sort on the DB server side) and the other writing all the data out to temp files on disk and pushing them through the bcp command-line utility with an ORDER hint.
Like, I'm almost considering just implementing it myself and submitting a pull request, but I'm not really looking forward to going through the (completely justified) processes it looks like I'd have to go through to add new stuff to the public API, which, admittedly, is probably a big part of why this has been sitting in Connect since 2007. Then again, all things considered, it's entirely possible that by going through this exercise, I would potentially save my employer more money than the total cost of a less efficient solution.
I actually ended up implementing it for .Net 4 by scavenging the somewhat newer dotnet source on github by adding a string property which corresponds to the ORDER hint of the BULK INSERT statement of SQL Server. The syntax here is
ORDER ( { column [ ASC | DESC ] } [ ,...n ] )
so having a collection of columns is a bit overkill, a single string will do. I would implement it and make a pull request but setting up the whole build and test environment just for this issue is simply too much time. Maybe there's a VM image for .Net development?
The bcp command-line tool also support specifying the ORDER hint in the -h parameter:
https://msdn.microsoft.com/en-us/library/ms162802.aspx
having a collection of columns is a bit overkill, a single string will do.
The main reason why I said collection of (column name, direction) instead of a single string has to do with usability. I had CollectionView.SortDescriptions in mind when I mentioned that, so there's also a precedent for this pattern.
The bcp command-line tool also support specifying the ORDER hint in the -h parameter
Yes, this is what I was referring to in my comment above. The problem with that is that we don't have the data files on disk for the command-line utility to use them (I'm actually looking into writing a fancy IDataReader
implementation backed by a Stream
which stores the data in a much more compact way; we also have some code that uses Stream
s from OPC packages), so we'd have to do the extra step of copying the data to disk just to make the command-line utility accept it.
I realize that some of this is already known by the participants here, but I think it best to be as clear as possible: the goal here is to get the insert bulk
command to have ORDER(...)
in the with
clause. Whether a collection for columns and directions or a string, it doesn't make much sense to have as a SqlBulkCopyOption
since it requires additional, non-standardizable (is that a word?) info. It would need to get passed into an overload of the constructor, and I suppose a collection would help prevent incorrect formatting, hence greater chance of success. The end result should be that the insert bulk
command submitted via TDS ends with something like the following (captured in a SQL Server Profiler trace):
with(TABLOCK,ORDER([object_id] ASC),KEEP_NULLS)
I certainly hope that nobody is seriously considering writing any data to disk and then calling the BCP command (which is not guaranteed to exist on any system) :wink:. I am not sure if there is a technical reason why this option has not been implemented by Microsoft as it _appears_ to be fairly simple, but it could be either not knowing how to best handle accepting the list of columns and directions, and/or perhaps there is a ramification to sending in the ORDER
hint and _not_ guaranteeing that the data is truly in that order (though according to the MSDN page for BCP, SQL Server validates the ordering anyway).
...OR...
OR, something that just occurred to me is:
If the goal is to provide a user-friendly mechanism, then perhaps a new SqlBulkCopyOption
could be used (i.e. SqlBulkCopyOptions.Order
or possibly SqlBulkCopyOptions.Ordered
). Regarding the ORDER
hint, the documentation for the BCP utility states:
If the data file is sorted in a different order, that is other than the order of a clustered index key, or if there is no clustered index on the table, the ORDER clause is ignored.
Meaning, the column names and directions to be supplied for this hint are not really free-form. There is only one definition, per each table, that is valid. Hence, the user doesn't truly _need_ to supply that column list as it is known by SQL Server. Given this requirement _and_ the fact that Clustered Indexes are always index_id = 1
, the column list can be auto-discovered prior to issuing the TDS insert bulk
command, using the connection info and destination table name provided. And this can be handled somewhere post call to WriteToServer
. The following query can be used to get the Clustered Index column and direction specification (and no rows returned indicates that there is no Clustered Index for the given table):
SELECT sc.[name],
CASE sic.[is_descending_key] WHEN 1 THEN 'DESC' ELSE 'ASC' END AS [Direction]
FROM [sys].[objects] so
INNER JOIN [sys].[index_columns] sic
ON sic.[object_id] = so.[object_id]
INNER JOIN [sys].[columns] sc
ON sc.[object_id] = so.[object_id]
AND sc.[column_id] = sic.[column_id]
WHERE sic.[index_id] = 1
AND so.[name] = @TableName -- sysname (i.e. NVARCHAR(128) )
ORDER BY sic.[key_ordinal] ASC;
As it relates to my suggestion (directly above) to use a new SqlBulkCopyOptions
enum value to query SQL Server for the key names and directions, I have just tested some imports via the BCP utility and confirmed that having the user provide the specific Clustered Index specification for the ORDER
hint is useless as there is only one valid value, and that value is easily discoverable:
ORDER()
specification does not match the Clustered Index, there will be a sort operation as indicated by sort_init
showing up in the trace, regardless of the order of the data in the file.ORDER()
specification matches the Clustered Index, _and_ the data is _not_ actually in order, the process will error upon encountering the first out-of-sequence row, similar to the following (PK value of 17 was line 3 of the file):SQLState = 37000, NativeError = 4819
Error = [Microsoft][SQL Server Native Client 11.0][SQL Server]Cannot bulk load. The bulk data stream was incorrectly specified as sorted or the data violates a uniqueness constraint imposed by the target table. Sort order incorrect for the following two rows: primary key of first row: (17), primary key of second row: (6).
- if the
ORDER()
specification matches the Clustered Index, _and_ the data _is_ in order, then there is no sort operation, as indicated by the absence of anysort_init
events (EventClass =SQLTransaction
). Woo hoo!
I support srutzky's idea about discovering the primary key automatically. Matching columns could be a bit tricky, though, if columns are manually mapped and are not in order. Another question is whether it fits into the standard how .Net classes should work and how unit tests for these types of operations should be implemented.
I certainly hope that nobody is seriously considering writing any data to disk and then calling the BCP command (which is not guaranteed to exist on any system) :wink:.
If it's only going to be run on environments we control, we can wave our magic wand and say that the BCP utility is guaranteed to be there. The context of this issue, and to a lesser extent bulk import tools in general, is that we're already in a situation where we're trying to maximize performance, and we're willing to trade off some of those other things that make our lives easier for general development.
There are some situations where it can theoretically be faster than the alternative to do the unintuitive thing where we make an extra copy of the data to the local disk wherever the application is running. If the database is sitting on a shared server that's severely overloaded, and the application is running on a machine with resources it can dedicate to just this operation, then it can _easily_ be faster to copy this stream of data to a disk with fast sequential write access, then use the BCP utility to copy the data straight into the destination table to minimize resource usage on the destination server. The alternative (SqlBulkCopy without ORDER(...)
hint) seems to be to put all the data into tempdb, then have the overloaded server copy it into the destination table and "sort it" on-the-fly. In this case, all the data needs to be read in its entirety twice anyway, but one of the reads is happening somewhere undesirable.
In fact, even if the database server isn't overloaded, it might still be shared by multiple applications, and it's just plain polite not to use more of those resources than we need to (scalability, blah blah). If it's acceptable for the application to hammer its local disk by copying all the data, and we can assume that the application's local disk has capacity for it (invoking the "magic wand" rule again), I'm going to at least consider that solution.
I haven't seriously measured anything as of this moment, so I could indeed be completely off-base with this. Maybe I'm not in a situation where the ORDER(...)
hint is worth copying to the local disk. I don't, however, need measurements to feel comfortable saying that being able to use the ORDER(...)
hint with SqlBulkCopy would be ideal, and that having it available would save me from doing measurements to figure out which sub-optimal solution is the least bad.
_I don't have anything useful to say about the idea to auto-discover parameters to ORDER(...)
. It's ultimately not something that I, as a consumer of this API, care enough about to make a fuss over._
One drawback SqlBulkCopy is that it's managed and it and the data reader behind it do a lot of boxing-unboxing behind the scenes. That's fine but it all happens on a single thread. This is one of the reasons why BCP can be faster. But this has nothing to do with the suggested improvement to the SqlBulkCopy implementation.
@airbreather
I think one of us is misunderstanding the other. Are you suggesting that the implementation of this particular issue should be to first write the data to disk? Or that writing to disk to call BCP is the alternative because this particular issue hasn't been implemented?
There is no disagreement regarding:
being able to use the ORDER(...) hint with SqlBulkCopy would be ideal
as that is the entire nature of this particular issue. I was just trying to make it clear that the purpose of this issue is to avoid copying anything to disk.
As a side note, while it _may_ be faster to write to disk first and then bulk load (vs SqlBulkCopy
without the ORDER
hint), I would use BULK INSERT before calling BCP. Still, SQL Server does a lot of internal caching. It definitely would be an interesting test to see which method truly is faster.
@dobos
Matching columns could be a bit tricky, though, if columns are manually mapped and are not in order.
Why would this be an issue? You specify the ORDER
hint in terms of the destination table columns, not in terms of the source file columns. It shouldn't matter how the columns are mapped as long as the data in the source column(s) that map to the key column(s) are in the expected order.
Are you suggesting that the implementation of this particular issue should be to first write the data to disk? Or that writing to disk to call BCP is the alternative because this particular issue hasn't been implemented?
The latter (workaround because we don't have this right now). Sorry for any confusion I may have caused.
I'm working on a tool that is heavily invested in bulk copying terabytes of data, and deferring to bcp
is vastly less efficient/impractical because of the requirement of streaming everything to disk first. The lack of support for an ORDER
hint hurts me a lot, because the intermediate tempdb sorts measurably slow down things. The obvious alternative is SSIS, but Integration Services has its own problems that don't make it a happy match for every scenario. In short, I'd like to see this in the framework (and I'd happily build it myself).
Personally, I would be content with a simple string OrderHint
property on the class, to be set before calling .WriteToServer()
, with no syntax checking or parsing on this property, which is passed as ORDER (<yourhinthere>)
in the INSERT BULK
command. Rationale: it's easy to implement, it's a quick win and it matches how Integration Services and bcp
do things (porting existing invocations is not an explicit design goal of SqlBulkCopy
, but it does currently serve as a fairly naive wrapper around INSERT BULK
).
Extending SqlBulkCopy
to go the extra mile and reach out to the server to grok the destination structure, per @srutzky , is much more work to do portably -- it would ideally also take into accounts such things as the fact that INSERT BULK
can target a view, so you'd have to dip down to the underlying table for the details.SqlBulkCopy
already puts in work to determine the table structure, using SET FMTONLY ON
and sp_tablecollations_100
-- the former being deprecated and the latter undocumented, but hey, it's all Microsoft. :-P Given this, determining the clustered index order wouldn't be much out of line, although that's still a bit more work than what it's doing now.
Even so, I would prefer to leave such niceties to a higher level, which can also fix the other deficiencies in SqlBulkCopy
that would be impossible to do retroactively for breaking backwards compat -- things like the fact that it heavily steers the naive developer towards DataTable
, doesn't set default options for max performance, etc. I would rather see SqlBulkCopy
offer the necessary baseline to make such things possible in the first place.
Does this need a full API writeup/proposal somewhere?
Hi @jeroen-mostert . Yes, dynamically discovering the clustered index column order is more work, but not that much given what the code is already doing. It could be as simple as adding the query noted in my second comment above as a new, 4th query in the TDSCommand
string. It would be done just prior to the following line:
And it will be executed at the same time as the other 3 queries in the "initial query", hence no extra objects or executions, etc.
Then, in the AnalyzeTargetAndCreateUpdateBulkCommand
method, the results of that query will be available in the internalResults
variable that was passed into the method. It would be used in a new if
block for a new enum value SqlBulkCopyOptions.Ordered
in the following if
block:
And that's pretty much it, outside of I guess needing an additional constant for the result set id in the collection, after the following lines:
something along the lines of:
private const int ClusteredIndexResultId = 3;
In the end, it doesn't really make sense to do this in any other level because the whole point is that the user shouldn't be forced to pass in this info in the first place. It is a static value, much like a password. If you pass in the correct value, then goodness else no goodness. So why pass in a value at all if the server already has the info? From a technical perspective, the ORDER (...)
option of the INSERT BULK WITH (...)
command requires that the correct column list be passed in, so we have to meet that requirement, but there is no reason for anyone to pass in that value to this code, or even to BCP.exe (outside of checking to see who is paying attention to their clustered index key columns).
That being said, this proposal does not necessarily conflict with the request to have a string OrderHint
property that populates the same ORDER ()
option. It could even be that OrderHint
, if it has a value, is used as is, else the value is auto-populated as I suggested if SqlBulkCopyOptions.Ordered
was passed in.
Also, regarding SET FMTONLY ON
and sp_tablecollations_100
:
SET FMTONLY ON
with sys.sp_describe_first_result_set, but then they would need an additional condition to handle SQL Server 2005, 2008, and 2008 R2. Chances are, SET FMTONLY ON
will never be removed.sp_tablecollations_100
proc. That's just internal code that we shouldn't use, but it likely exists for just this purpose.In the end, it doesn't really make sense to do this in any other level because the whole point is that the user shouldn't be forced to pass in this info in the first place. It is a static value, much like a password.
This is true only if we're allowed to assume the clustered index doesn't change. Imagine for a moment it did change: code that passes an ORDER
hint will not fail if it actually passes the rows in the order specified -- it will incur a sort step, but that's a performance regression, not a functional regression. (It could be really bad to the point where it effectively is a functional regression, depending on the scenario, but we don't know.)
On the other hand, if it does change, and code is not aware of it, then the bulk insert will just fail at the first batch (or the end, if all rows are inserted in one batch) with an ordering error. This is not necessarily a deal breaker if this behavior is clearly documented with the option, but I'm not sure if I'd want this, myself. Ultimately, the way the INSERT BULK
command implements this is the real clumsiness, but I would err on the side of caution and give SqlBulkCopy
the ability to reflect the same clumsiness, for better or worse. It already goes out of its way to be just as clumsy as a native bulk copy (how about the fact that you're more or less required to set the ColumnMapping
, lest you get the completely unhelpful ordinal mapping)?
(Also, as an aside, I don't know what, if anything, this option does when inserting in a heap table. I expect "nothing", except for the error if you violate the order, but I'm too lazy to do a thorough test at the moment. I could imagine a heap table insert always being "order not 100% guaranteed" without the hint, rather than "physical order of rows supplied, whatever it is". The fact that this question can be asked in the first place is another reason I'd prefer not to get too clever.)
regarding
SET FMTONLY ON
andsp_tablecollations_100
:
Yes, I know. The smiley was just a rueful way of remarking that if MS was willing to go the extra mile to fetch some metadata by internal and convenient means, they could have actually made SqlBulkCopy
better in the first place. The current behavior (see also: ordinal mappings) I would really only find excusable if metadata discovery wasn't in there at all. (There's precedent for that, viz. not determining the structure of TVPs passed in a SqlCommand
, with the accompanying risk of conversion errors.)
Splitting this off to its own comment because editing of the previous one got out of hand; that should be stable now. :-)
That being said, this proposal does not necessarily conflict with the request to have a
string OrderHint
property that populates the sameORDER ()
option. It could even be thatOrderHint
, if it has a value, is used as is, else the value is auto-populated as I suggested ifSqlBulkCopyOptions.Ordered
was passed in.
The order can't be determined until .WriteToServer
is called, when we know the .DestinationTableName
(the weird inconsistency between non-essential options set in the constructor and essential options being set in properties after construction is another peeve of mine -- but I digress). The property OrderHint
could only be populated after the copy was done (overwriting only on an empty value), which I don't think is worth exposing. You'd end up with an API where it's a really good idea to set the OrderHint
yourself, unless you don't, but you do use SqlBulkCopyOptions.Ordered
, when we will set the hint for you, and have mercy on your soul if you didn't already order your data the way you somehow already knew it had to be ordered. It almost starts making more sense to have a dedicated class for discovering the clindex metadata and building the hint for you, but then we're back to something that I'm not sure the framework absolutely needs to have.
Guys, I don't want to sound unprofessional, but I think it's overdiscussed. Someone who optimizes bulk inserts with this class is not an average user but a skilled software engineer for whom being able to specify the hint as a string is a perfect solution.
being able to specify the hint as a string is a perfect solution.
Commenting to add my support / rephrase / elaborate: I care a lot about making this possible through the SqlBulkCopy
managed API at all, and so I would be more than willing to live with a slightly less usable solution than the ideal, if it can help lead to having something rather than nothing sooner.
I'm not using SqlBulkCopy
without knowing my data or where it's going. Rather, I'm using SqlBulkCopy
because I know my data, where it's going, and why I shouldn't instead be using a bunch of INSERT
statements to get it there.
I'm looking to specify this hint because I know that it's in the correct order; I only know this because I put it in that order (or I trust the data provider to have done so). If anything, it may perhaps be better for me to need to specify the order that the data is sorted in, since that's all that I can actually reasonably promise.
Also, really small side note, I very slightly prefer a solution like:
sqlBulkCopy.OrderHintColumns.Add("Column1", Order.Descending);
sqlBulkCopy.OrderHintColumns.Add("Column2", Order.Ascending);
sqlBulkCopy.OrderHintColumns.Add("Column3", Order.Descending);
over a solution like:
sqlBulkCopy.OrderHint = "[Column1] DESC, [Column2] ASC, [Column3] DESC";
Again, though, that's really minor.
No rush, this hasn't been in the framework for 13 years (counting from when SqlBulkCopy
was introduced in .NET 2), and bulk copying hasn't changed in importance -- some time to get the interface right is fine. The "let's have this in the framework soon" ship has definitely sailed, and it (probably) wasn't because of implementation discussions that ran into trouble. Personally I'm all for the simple implementation because it's good enough for me, but I'm not the maintainer.
In order to actually get this into the FW, a separate API proposal is actually necessary anyway (even if it's just one property, it's a change to the public API). Regardless of the implementation, writing the code isn't the hard part. I know because I did it in about 10 minutes (for the straightforward OrderHint
addition). [Edit: don't actually use this, though, it has a silly bug. The point that the implementation won't take the most time stands... that'd be testing, as always.] Setting up things so you can build CoreFX is of course a bit more work. My plan was never to wait for the next version of the framework for this support anyway -- I need it sooner too. 馃槂
being able to specify the hint as a string is a perfect solution.
and
I would be more than willing to live with a slightly less usable solution than the ideal, if it can help lead to having _something_ rather than _nothing_ sooner.
Yes, I agree with all of that. I was just making a suggestion in a vain attempt at working around/covering up a possibly clunky interface in the form of the ORDER()
option. But it was in no way an attempt to force an additional requirement for this request. If the general consensus is that it be either OrderHint
or OrderHintColumns
, then I, too, would be quite happy to have that as opposed to no ability to specify ORDER()
.
I鈥檓 excited about the possibility of SqlBulkCopy supporting ordered inserts!
However, the idea of passing a raw order specification string into the query makes me hesitate. This opens up a vector for a SQL injection attach. It _also_ opens the door for developers to purposefully hack this property to inject other hints/otherwise change the query being executed (imagine a developer using an ordering specification string of [object id] ASC),SomeOtherHint) --
).
Something like the following guards against the above:
sqlBulkCopy.OrderHintColumns.Add("object id", Order.Descending);
the idea of passing a raw order specification string into the query ... opens up a vector for a SQL injection attach.
...
[using theOrderHintColumns
collection suggestion from @airbreather ] guards against that.
True. Good point. Specifying each column separately allows for easily applying the same escaping pattern used in other places in this class to each entry, for example:
I propose we use the existing SortOrder
enumeration for this, also conveniently living in System.Data.SqlClient
. The .Add
would need to throw if you try to add a column with SortOrder.Unspecified
, since the hint doesn't support that (obviously), but that minor niggle isn't worth creating a brand new enum with almost the same purpose, IMO.
That just leaves "boring" questions like whether the new SqlBulkCopyOrderHintColumnCollection
should be a non-generic collection inheriting from CollectionBase
, for consistency with SqlBulkCopyColumnMapping
, or whether we just go the extra mile and make it a strongly-typed list of something... I don't know if there's policy for this. I'm OK with a non-generic collection; the API surface would look the same, and the consistency is more compelling than making just this one auxiliary collection "neat".
I propose we use the existing
SortOrder
enumeration for this, also conveniently living inSystem.Data.SqlClient
. The.Add
would need to throw if you try to add a column withSortOrder.Unspecified
, since the hint doesn't support that (obviously),
Why do you say that SortOrder.Unspecified
is not supported? Shouldn't that option merely not add either ASC
or DESC
after the column name? In which case that is definitely supported by the ORDER()
hint. The documentation for INSERT
shows the direction as optional:
[ [ , ] ORDER ( { column [ ASC | DESC ] } [ ,...n ] ) ]
And I just tested it and it works:
BCP tempdb.dbo.OrderedImportTest IN .\OrderedImportTest.txt -t "|" -S (local) -T -k -c -h "TABLOCK,ORDER([Col1])"
I then tested an out-of-order file:
BCP tempdb.dbo.OrderedImportTest IN .\OrderedImportTest_WrongOrder.txt -t "|" -S (local) -T -k -c -h "TABLOCK,ORDER([Col1])"
and, as expected, it failed:
SQLState = 37000, NativeError = 4819
Error = [Microsoft][SQL Server Native Client 11.0][SQL Server]Cannot bulk load. The bulk data stream was incorrectly specified as sorted or the data violates a uniqueness constraint imposed by the target table. Sort order incorrect for the following two rows: primary key of first row: (3), primary key of second row: (2).BCP copy in failed
It also failed if I added ASC
after [Col1]
. But, it succeeded if I added DESC
after [Col1]
because that was the incorrect spec and so the ORDER()
hint was ignored.
Yes, leaving out ASC
/ DESC
is supported in the sense that if you don't specify it, the default is ASC
(just as in a regular ORDER BY
clause -- I verified this is the case, and that it's not actually taking the ASC
/ DESC
order from the index). But SortOrder.Unspecified
actually has a different meaning, namely that the order is unspecified (as in, the natural state of columns). It's used this way in SqlMetaData
. The semantically correct thing to do would be to skip columns with SortOrder.Unspecified
and not add them to the ORDER
hint at all -- but that's weird. (There could be some very limited value in this for automatic generators, I suppose, but only lazy ones.)
We could reappropriate Unspecified
in this case to support the "use case" of specifying the ORDER
hint in a particular way -- but from an API perspective, I'd prefer it if people were forced to be explicit about what they mean. Allowing Unspecified
gains you nothing new in terms of functionality and is potentially confusing in meaning. In effect, it's two values that mean the same thing. (I'm willing to discount the possibility that MS will ever change the semantics of the ORDER
hint so specifying nothing actually means something different to ASC
or DESC
.)
That said, I'm not strongly attached to this, and it would save an exception.
As recently announced in the .NET Blog, focus on new SqlClient features an improvements is moving to the new Microsoft.Data.SqlClient package. For this reason, we are moving this issue to the new repo at https://github.com/dotnet/SqlClient. We will still use https://github.com/dotnet/corefx to track issues on other providers like System.Data.Odbc and System.Data.OleDB, and general ADO.NET and .NET data access issues.
Duplicate of #23.
Will close this issue as we're using #23 to track this request.
Most helpful comment
As it relates to my suggestion (directly above) to use a new
SqlBulkCopyOptions
enum value to query SQL Server for the key names and directions, I have just tested some imports via the BCP utility and confirmed that having the user provide the specific Clustered Index specification for theORDER
hint is useless as there is only one valid value, and that value is easily discoverable:ORDER()
specification does not match the Clustered Index, there will be a sort operation as indicated bysort_init
showing up in the trace, regardless of the order of the data in the file.ORDER()
specification matches the Clustered Index, _and_ the data is _not_ actually in order, the process will error upon encountering the first out-of-sequence row, similar to the following (PK value of 17 was line 3 of the file):