Bulk inserting into database using ADO.NET... deadly slow?

  • Thread starter Thread starter Massimo
  • Start date Start date
M

Massimo

I have this SQL Server 2000 database, which I want to populate with data
generated from a program; the schema is quite simple (one table, seven
columns, each one one byte wide), but there are plenty of data: about 450
million rows, for an estimated total of ~3 GB.
So, I wrote my data-generating .NET application, and then tried to put data
into a DataSet and then storing it into the DB using a SqlDataAdapter... and
it was painful. My first guess was to store as much data as possible into
the DataSet, so to reduce the DB updates, but I found that when reaching
about 100K rows the DataSet becomes slower and slower, each insertion taking
more time than the preceding ones. So i resorted to update the DB more
often, and I tried doing it every 20000, every 10000 and every 5000 rows
added, but it always seemed to take a lot of time (i.e. more than one minute
to insert 20K rows).
Finally, I decided to avoid DataSets, and I simply wrote all of the program
output to a 6 GB CSV text file, and imported it into the DB using DTS; and,
wonderful... it started inserting 40-50K rows *per second*; after about five
hours the job was completed.
So, here's the question: why is it so painfully slow to insert big chunks of
data into a DB using DataSets? It's so slow that even a 6 GB text file
provides a *much* better throughput...

Massimo
 
Are you inserting into a database with indexes?
If you are, then this slow performance will happen.
Delete the indexes, insert the data, then create indexes.
 
It's the same problem if you tried pulliing a huge Yacht on the back of a
ferrari. Might be a great car, might be physically able to do , but
definitely the wrong tool for the Job. DTS and Bulk Insert are designed
specifically datatransfer. ADO.NET is designed primarily for disconnected
data access.. Besides, as much as I love ado.net, breaking out the old DTS
designer and getting down is always fun ;-)
 
It's the same problem if you tried pulliing a huge Yacht on the back of a
ferrari. Might be a great car, might be physically able to do , but
definitely the wrong tool for the Job. DTS and Bulk Insert are designed
specifically datatransfer. ADO.NET is designed primarily for disconnected
data access..

What would be the right way to add programmatically huge amounts of data to
a DB in a .NET application, then?

Massimo
 
Are you inserting into a database with indexes?
If you are, then this slow performance will happen.
Delete the indexes, insert the data, then create indexes.

I don't think this is the problem... the only index is the primary key (it's
needed), and anyway it's the same DB when accessing it via ADO.NET and when
using DTS.

Massimo
 
So, here's the question: why is it so painfully slow to insert big chunks of
data into a DB using DataSets? It's so slow that even a 6 GB text file
provides a *much* better throughput...

Because each inserted row using ADO/ADO.Net requires execution of a SQL
INSERT statement. Not only does it require a network roundtrip for each, it
also requires a SQL statement parse and an execution plan developed - some
performance improvement is realized if a cached copy is available, but even
that takes a bit of time.

If you are going to do this sort of thing on a regular basis, consider
switching to native mode programming and use ODBC bulk copy functionality.
here is a link:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odbcsql/od_6_040_101f.asp

regards
roy fine
 
There are not many options, I would have tried the following:
(1) In the app select less records at a shot, 50K for instance, where you
know it is still fast.
(2) Write the data to a file and call the "BULK INSERT" statement from your
program (look in your SQL Server documentation for "BULK INSERT").
Compare the speed of the two and decide which one to use.
 
If you have SQL Server, you can create a dts package and a job (or a job
without DTS using any mechanism [just mentioned for other reference]) and
invoke it from .NET. But you can also use the scehduler on SQL Server which
is pretty darned powerful and do it all from there.
 
Hello Massimo,

It's very unlikely that we'll be able to beat DTS. However, we can get
closer by doing this:

1. Wrap updates in transactions. This will give you a noticeable perf boost.
You mention before that you insert in rounds of ~5K/10K rows. Right before
calling Update() on the adapter, call begin transaction on the connection
object that you're using with the adapter, and set the resulting
SqlTransaction instance to the update/insert/delete commands. Then commit
after the call to Update and call beingtransaction again before the next
call to repeat the cycle.

2. You can also bypass the dataset/adapter all together and use INSERT
statatements with SqlCommand.ExecuteNonQuery. In this case you can
concatenate many insert statements together and wrap each batch in a
transaction. This will give another perf boost because you'll reduce the
round-trips in several orders of magnitude.


FWIW, in the next version of ADO.NET we'll have bulk-insert APIs, so you can
expect huge perf improvements in this area :)

--
Pablo Castro
Program Manager - ADO.NET Team
Microsoft Corp.

This posting is provided "AS IS" with no warranties, and confers no rights.
 
1. Wrap updates in transactions. This will give you a noticeable perf
boost. You mention before that you insert in rounds of ~5K/10K rows.
Right before calling Update() on the adapter, call begin transaction on
the connection object that you're using with the adapter, and set the
resulting SqlTransaction instance to the update/insert/delete commands.
Then commit after the call to Update and call beingtransaction again
before the next call to repeat the cycle.

I already imported anything I had to, but I'll remember the tip for next
time :-)
Why is it so much faster when using transactions?
FWIW, in the next version of ADO.NET we'll have bulk-insert APIs, so you
can expect huge perf improvements in this area :)

That's indeed good :-)

Massimo
 
Here is a stored proc I wrote that you can use royalty free ;>)

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[BulkInsertFileByTable]') and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[BulkInsertFileByTable]
GO

SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO

CREATE PROCEDURE dbo.BulkInsertFileByTable
(
@TableName varchar(256),
@FilePath varchar(256)

)

AS

declare @Command varchar(512)

set @Command = 'BULK INSERT dbo.[' + @TableName + ']' + ' FROM ' + '''' + @FilePath + ''''


exec (@Command)







RETURN

GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
 
Here is a stored proc I wrote that you can use royalty free ;>)

What is it for?
It just executes a BULK INSERT, so what's the benefit of using the procedure
instead of simply doing it?

Massimo
 
You can then call it using ado.net without using inline sql statements and programatically do Bulk Insert


----- Massimo wrote: ----

Here is a stored proc I wrote that you can use royalty free ;>

What is it for
It just executes a BULK INSERT, so what's the benefit of using the procedur
instead of simply doing it

Massim
 
You can then call it using ado.net without using inline sql
statements and programatically do Bulk Inserts

That's ok, but when I said "programmatically" I was intending "insert
directly into the DB the data I'm generating in my application", not "write
them to some text file and then load it" :-)

Massimo
 
Why is it so much faster when using transactions?

in general, if you have to perform several small side-effecting operations
(such as inserts), it's better to wrap them in a transaction. the main
reason for this is that every time you complete a transaction (or an
individual statement outside a transaction) the transaction log needs to be
flushed to disk, along with control information in the log. if you wrap
several operations in a transaction, you flush more data but all together
and not so often; if you don't use transactions for this, you'll see that
SQL will be hitting the disks hard because it has to continuously flush the
log for each DML operation that you perform.

Note that for other operations that are not disk-write intensive (e.g. DML
on table variables) transactions won't help and may actually regress perf.
Also, the inserts will take a while, you may see some extra contention by
using transactions.

--
Pablo Castro
Program Manager - ADO.NET Team
Microsoft Corp.

This posting is provided "AS IS" with no warranties, and confers no rights.
 
You should use BCP in 'fast' mode
i.e. make sure the table has got no indexes. Drop them and put them back on again if necessary.
 
Back
Top