Iteration of Large Result Set

Kenny · Dec 15, 2008

Summary:

Large Table + DataReader Iteration = Slow
Large Table + Cursor Iteration = Fast

Details:

I am working on a windows app that is connecting to a MS SQL server db.

It is not designed for user interaction of any kind â€¦ it just has alot of
Reading to do (no updates) and wants to get it done quickly.

Iâ€™d like to be able to use a solution thatâ€™d work all the way back to
sqlserver 2000 but if the solution only lends itself to a newer version then
I can deal with that too.

I have millions of records that I import and insert via flat files and
SqlBulkInsert w/o any complaint in performance.

However â€¦ trying to work with the imported data inside sqlserver is proving
to be very challenging.

For purposes of example we can say that the table has 30 millionâ€™ish rows
and a structure as follows:

Account
Product
Date
Time
Status
Description
Amount

I have an non clustered index built on Account,Product,Date,Time,Status

Let me mention two things right now â€¦

1. This isnâ€™t the exact structure so I canâ€™t include all the fields I need
in a covering index â€¦ that is why I left out Description and Amount â€¦ to
visually show that I gotta dip back into the db when I have a â€œvalidâ€ record
to process.

2. I donâ€™t build a clustered index because that takes too much time to begin
with.

I need the data ordered as: Account,Product,Date,Time and need to skip some
records whose Status = â€˜BADâ€™ letâ€™s say.

I have to stay away from a DataTable cause any single Account could have
millions of records. However it is often the cause that they only have a few.
So â€¦ DataReader it is.

I have no problem with the forward only business logic needed to employ the
datareader and have tried three different approaches.

1. With the Account list known and lets stay already stored inmemory in a
List<> of some kind â€¦ the most straightforward approach is to loop over the
list-o-Accounts and fire off a SqlDataReader with a select statement where
Account=@Account and Status<>â€™BADâ€™ and ordered by Account,Product,Date,Time.
This often results in a slight delay between each Account. Well if this were
a user driven type app a delay of 500 millisecs between accounts would be
fine. But if there are 100,000 accounts, a delay of 500 millisecs is killer
to this â€œhands freeâ€ app.

2. So I figured that maybe I could just get the entire thing as a
SqlDataReader in one swipe. Iâ€™d do the control breaking on Account in code.
Well that doesnâ€™t work because MSSQL server seems to want to buffer the
entire query before returning a single row. That sorta defeats the purpose
aye? I looked and looked for something that would set a fetchsize or buffer
size or cache size but couldnâ€™t find it.

3. So then I figured Iâ€™d write some code to perform multiple smaller readers
that spanned groups of Accounts. So instead of lets say 100,000 Accounts
resulting in just as many SqlDatReader calls, I now have maybe 5,000 calls
that maybe look like AccountNumber>â€™AAâ€™ and AccountNumber <=â€™ABâ€™ or something
similar. Then Iâ€™d still control break manually over the returned reader
values. This too is much too slow.

Number 1 ran for hours until I stopped it.

Number 2 would never get past the ExecuteReader() until I stopped it after
like an hour of so. I tried with(NOLOCK) to see if thatâ€™d help. Maybe thereâ€™s
something diff?

Number 3 started running quickly enough but took approx 3 hours to finish.

So â€¦ just for fun I wrote a server side cursor that iterated over the entire
table in the exact same manner minus the manual control break logic.

It finished in 10 minutes!

Thatâ€™d be great!

But I have too much backend logic to use a T-SQL cursor and no real desire
to try and put all the â€œfutureâ€ logic into SQLCLR.

All I really want is for Number 2 to work.

Iâ€™d like to get a DataReader on the entire table. If the cursor test can
complete in 10 minutes then hopefully there is a way to get the DataReader to
start returning records quicker.

Why wonâ€™t MSSQL server start â€œfeedingâ€ my DataReader the records right away
or at least after it has buffered enough to make it happy?

Sorry for the long winded post but it was my first and Iâ€™m trying to cover
everything for anyone that might wish to offer some insight.

I know itâ€™d be best to chunk it up where possible but I kinda tried that
using test Number 3.

I really do need to iterate over the entire table in an order matching an
already built index.

I donâ€™t need to update anything â€¦ just simply read over the data.

Thanks!

Trevor Benedict · Dec 15, 2008

Kenny,
"MSSQL server seems to want to buffer the entire query before returning a
single row."

If you have an "Order by" clause in your select statement, it is expected to
be the case.

WITH (NOLOCK) is a very good hint to use for pure select statements

You said millions of records. I believe you are spending too much time
moving data between the server and the client and then a similar load when
updating the data. This data transfer can be avioded if you move your
backend code to where it belongs, but then you are the only one who knows
how much of work is involved.

I had been there before and using backend logic improved so much of
performance (from 3 weeks to 3-7 days of processing time).

Regards,

Trevor Benedict

Kenny · Dec 15, 2008

Thanks for taking the time to read and reply Trevor!

Also ... I think I goofed on the cursor compare.

I wasn't "forcing" the record lookup past the index.

So ... basically ... in the cursor test I was using a covered index by
mistake.

Sorry for the false information ... I thought I had a "smoking gun" before I
posted.

That made all the difference in the world.

Itâ€™s basically been running for over 30 minutes now so maybe the cursor will
take forever too which kinda makes me feel better in a strange sorta way.

You said:

If you have an "Order by" clause in your select statement, it is expected to
be the case.

But the â€œOrder byâ€ is the exact same as the index Iâ€™m using â€¦ and â€¦ Iâ€™ve
tried â€œforcingâ€ that index by providing the INDEX(IX_Name) hint too. Youâ€™d
think itâ€™d be happy with that.

Since the cursor code seems slow now too â€¦ I thinking there may be no way to
do what I want to do.

I think the record lookup is whatâ€™s killing me.

Iâ€™m kind of afraid to move the code to the server because thereâ€™s a lot of
crazy logic to follow besides just the Read that Iâ€™m currently testing.

My gut tells me such code shouldnâ€™t be â€œinsideâ€ SQL Server.

Also â€¦ since my logic was flawed on the cursor test â€¦ if indeed the actual
record lookup is the killer then it wouldnâ€™t help me anyway and you prolly
wouldnâ€™t have suggested it in the first place.

Either my test box is just way underpowered for what I'm trying to
accomplish or Iâ€™m going to have to rethink everything.

Kenny · Dec 22, 2008

UPDATE:

I've played with the code I wrote to perform multiple smaller readers
that spanned groups of Accounts.

I'm now using a "real" Thread instead of a BackGroundWorker and instead of
100,000 records per group ... I bumped it upto 2,000,000.

The task now completes in just under one hour.

While this is still only approx 8k records per second ... it is at least
kinda tolerable.

I'd imagine if I had a better test machine the results would be much better
as well.

William Vaughn \(MVP\) · Dec 23, 2008

Ah, did you consider not bringing the rows to the client and performing the
operation on the server?

--
__________________________________________________________________________
William R. Vaughn
President and Founder Beta V Corporation
Author, Mentor, Dad, Grandpa
Microsoft MVP
(425) 556-9205 (Pacific time)
Hitchhikerâ€™s Guide to Visual Studio and SQL Server (7th Edition)
http://betav.com/blog/billva
http://betav.com
____________________________________________________________________________________________

Cor Ligthert[MVP] · Dec 27, 2008

Kenny,

A little bit in addition to Bill (William(

Your question can have a simple answer, SQL to Linq is exactly where you are
asking for.

It gives you only back the resultset as you have described in your Linq
statement.

Cor

Kenny · Dec 31, 2008

Server side code would prolly be the most efficient in that respect ... but
.... I know what code needs added in the future and the logic is simply too
much to ask of the SQLCLR.

Tis not "rational" ...

I've actually gotta do so crazy things like generate/write PCL code to
individual files from potentially millions of SQL records.

Writing to external files and other crazy things that I've left out make
using SQLCLR almost seem like abuse of the Server! {;-)

That

Kenny · Dec 31, 2008

Please excuse my possible ignorance but I do not believe LINQ will help me
with speed ... maybe speed of development but not speed "in code".

LINQ still needs to dip into the db and the major problem is the large IO
overhead and movement of data to the client.

I was trying to find ways to lower the "individual query cost" and know that
the best case for speed for the complete result set iteration has gotta be
the DataReader.

But ... the result set is too big for a single DataReader and SQL server
won't just start feeding me records because I use an ORDER BY in the select
statement ...

However I have an INDEX that exactly matches the ORDER BY.

That's distrubing to me!

Life would be grand if it'd just follow that index ...

I could just issue one DataReader and the server would instantly start
spitting data at me.

It works if I use a CLUSTERED index but that results in more worktime to
build than actually doing what I'm doing right now.

And there are too many records to keep in memory on the client app ... while
.... bouncing back and forth to the database for every customer with a
DataReader simply takes too long.

On customers with millions of recs it is not too bad but some only have a
few records and this often causes a long pause between some select queries as
sql tries to find and load each case.

I figured CHUNKING large files for transmission over IP or read/write to
disk is always better than streaming single bytes ... so ... I figured
CHUNKING the large result set into many DataReaders would help too.

And it does ... but ... returning larger and larger result sets via the
DataReader improves performance more and more ... I'm sure only upto a point
cause sooner or later you max out the server itself.

MaxPoolSize - Recommendations	4	Sep 1, 2006
Count all records for all tables	5	Jul 3, 2008
Adding commands at runtime to generated adapters	2	Jan 28, 2006
Maybe Update Query?	2	Jun 3, 2010
Add 31 records w/ consecutive dates w/ command button?	2	Apr 22, 2010
Using Cartesian Cell Names	1	Apr 4, 2010
Word 2007 - Startup settings	5	Mar 21, 2010
Please help me evolve this concept	2	Jan 7, 2006

Iteration of Large Result Set

Kenny

Trevor Benedict

Kenny

Kenny

William Vaughn \(MVP\)

Cor Ligthert[MVP]

Kenny

Kenny

Ask a Question

Similar Threads