Best way to process millions of records

Peter Sedman · Jan 4, 2005

I have a MS SQL Server database table that contains 5 million records. The
primary key of the table is a GUID.

I need to write an application that reads each of these records and passes
some values to another process.

What's the best way to retrieve the records to be processed?

Reading the records into a DataTable would probably result in far too much
memory usage.

Does using a DataReader pass the load (and hence memory problem) onto SQL
Server instead?

Reading batches of records would be difficult as the primary key is a GUID
and is not sequential.

What about reading the primary key values into a DataTable and then reading
batches of records?

Has anyone had to solve a similar problem?

Thanks,
Peter

John Saunders · Jan 4, 2005

Peter Sedman said:
I have a MS SQL Server database table that contains 5 million records. The
primary key of the table is a GUID.

I need to write an application that reads each of these records and passes
some values to another process.

What's the best way to retrieve the records to be processed?

I would just use a DataReader and trust SQL Server and ADO.NET to implement
reasonable buffering.

On the other hand, if you have control of the code that sends to the other
process, you may want to batch up what you send to that process, or at least
implement a buffering scheme.

John Saunders

User · Jan 4, 2005

I don't see why the "non-sequential" matters.
If you need to read in some kind of order, do an "ORDER BY" in SQL Server
and then use a DataReader.

W.G. Ryan eMVP · Jan 4, 2005

It kinda depends on what the other process is. In general - I would say the
DataReader is definitely preferrable to the DataTable in this case since you
don't appear to need to serialize the data or keep it's state after you pass
it off. To this end - you will definitely save some resources - big time
with that much data.

However, if you are processing 5 million records- I'd look to DTS if
possible - much better suited to many tasks involving that much data - again
though it depends on the nature of the processes you are passing the data
to.

5 million records is a lot to work with in any environment - if you're
processing them all in one process. You may want to look at breaking down
the records into smaller more manageable sets - for many reasons - primiarly
if somethign goes wrong - you can notify and respond a lot quicker if you
break it down into smaller sets.

HTH,

Bill

William \(Bill\) Vaughn · Jan 4, 2005

I would take a look at server-side processing in this case--5 million rows
is a lot. Asking the server to sort them first can be an issue--especially
if there is not already an index in place. Once SQL CLR is here, it will be
easy (easier) to code a sophisticated routine to process the rows. At this
point, I would try to do everything I needed to do in a SP. Once the SP has
done its work, it can write the rows to a temporary table (hopefully a
subset of the original) and THEN you can use BCP/DTS to move it somewhere.
Asking ADO.NET (even with a data reader) to move the rows to the client is a
waste of time and resources (IMHO).

--
____________________________________
William (Bill) Vaughn
Author, Mentor, Consultant
Microsoft MVP
www.betav.com
Please reply only to the newsgroup so that others can benefit.
This posting is provided "AS IS" with no warranties, and confers no rights.
__________________________________

Frans Bouma [C# MVP] · Jan 4, 2005

Peter said:
I have a MS SQL Server database table that contains 5 million records. The
primary key of the table is a GUID.

I need to write an application that reads each of these records and passes
some values to another process.

so your select should only return those columns.

What's the best way to retrieve the records to be processed?

what's the process doing with the values? in this case, with a lot of
rows, server-side processing is often the only way to get a decent
performance. Unless that other process is not under your control of
course and simply needs to get fed by all 5 million rows.

Reading the records into a DataTable would probably result in far too much
memory usage.

yes, beyond 57,000 rows, the datatable is dead in the water.

Does using a DataReader pass the load (and hence memory problem) onto SQL
Server instead?

If you do a SELECT * from table and table contains 5 million rows,
Sqlserver will have to store the temp resultset somewhere. This is often
done in memory if the resultset is small, but with a large resultset, it
might be it uses the tempdb to store the cursor. (depends on the size of
the resultset, if the resultset is 30MB for example, it still might use
just memory to keep the resultset)

a datareader is in fact a server-side cursor. The sqlclient will
receive batches of data from the server and these are not that large. So
this could work in your situation.

Reading batches of records would be difficult as the primary key is a GUID
and is not sequential.

What about reading the primary key values into a DataTable and then reading
batches of records?

no can do. way too many rows.

Even though I'm generally against the usage of stored procedures in
many situations, in your particular case it suits the job I think, so in
short: best way to do this is to write a proc which does the processing
that other process is doing for you, which solves you from the
data-pumping you have to do otherwise.

Frans.

--

Sahil Malik · Jan 4, 2005

Peter here are my 2 cents.

In order of decreasing preference I would use.

1. Dataset/DataTable <---- totally stay away from this in this case.
2. Datareader
3. Cross linked queries
4. DTS
5. BCP <--- my recommendation

... so essentially my topmost recommendation is BCP, but I wanna add 1 more
thing. GUID is a pig compared to Int32. See if you can do away with GUID on
a table with a million rows.

BTW, for a comparison, I wouldn't be surprised if BCP is about a 100 to a
1000 times faster than DataReader.

If the other process is something OTHER than another DB, then you should
look into DTS. ... What is the other process that expects these millionish
rows?

- Sahil Malik
http://dotnetjunkies.com/weblog/sahilmalik

W.G. Ryan eMVP · Jan 5, 2005

Amen to that. Even on my dual processor box - 5 million records is a hog.

Sahil Malik · Jan 5, 2005

Even on my dual processor box

<green> I'm not jealous, I'm not jealous </green>

Cor Ligthert · Jan 6, 2005

Sahil,

Dual processor box does not say anything, they exist already for Windows
systems from the Pentium One time.

Therefore there exist even with 286/386/486 however as far as I remember me
they did nothing in Windows systems and exist of course in non Window OS's
with all types of processors.

Therefore what Bill says is the same as "I have a lot of Pk's" which can
mean really 2 horses before the car.

Cor

Miha Markic [MVP C#] · Jan 6, 2005

Hi Cor,

I am not sure that I entirely understand your post, however, Windows NT and
newer have support for dual processor systems and they take advantage of it.
And I really doubt that 386 and older had SMP capability. :-)

Cor Ligthert · Jan 6, 2005

Miha,

I don't know anymore if NT did work with dual 486, however I am sure it was
with Pentium 1, so I did not wanted to start to low.

:-)

Cor

Cor Ligthert · Jan 6, 2005

Miha,

I used the wrong words

"Therefore" should be "Before that" I used automaticly the Dutch words "Daar
voor"

Sorry and thanks attending me on that.

Cor

Best way to process millions of records

Peter Sedman

John Saunders

User

W.G. Ryan eMVP

William \(Bill\) Vaughn

Frans Bouma [C# MVP]

Sahil Malik

W.G. Ryan eMVP

Sahil Malik

Cor Ligthert

Miha Markic [MVP C#]

Cor Ligthert

Cor Ligthert