Iteration of Large Result Set

  • Thread starter Thread starter Kenny
  • Start date Start date
K

Kenny

Summary:

Large Table + DataReader Iteration = Slow
Large Table + Cursor Iteration = Fast

Details:

I am working on a windows app that is connecting to a MS SQL server db.

It is not designed for user interaction of any kind … it just has alot of
Reading to do (no updates) and wants to get it done quickly.

I’d like to be able to use a solution that’d work all the way back to
sqlserver 2000 but if the solution only lends itself to a newer version then
I can deal with that too.

I have millions of records that I import and insert via flat files and
SqlBulkInsert w/o any complaint in performance.

However … trying to work with the imported data inside sqlserver is proving
to be very challenging.

For purposes of example we can say that the table has 30 million’ish rows
and a structure as follows:

Account
Product
Date
Time
Status
Description
Amount

I have an non clustered index built on Account,Product,Date,Time,Status

Let me mention two things right now …

1. This isn’t the exact structure so I can’t include all the fields I need
in a covering index … that is why I left out Description and Amount … to
visually show that I gotta dip back into the db when I have a “valid†record
to process.

2. I don’t build a clustered index because that takes too much time to begin
with.

I need the data ordered as: Account,Product,Date,Time and need to skip some
records whose Status = ‘BAD’ let’s say.

I have to stay away from a DataTable cause any single Account could have
millions of records. However it is often the cause that they only have a few.
So … DataReader it is.

I have no problem with the forward only business logic needed to employ the
datareader and have tried three different approaches.

1. With the Account list known and lets stay already stored inmemory in a
List<> of some kind … the most straightforward approach is to loop over the
list-o-Accounts and fire off a SqlDataReader with a select statement where
Account=@Account and Status<>’BAD’ and ordered by Account,Product,Date,Time.
This often results in a slight delay between each Account. Well if this were
a user driven type app a delay of 500 millisecs between accounts would be
fine. But if there are 100,000 accounts, a delay of 500 millisecs is killer
to this “hands free†app.

2. So I figured that maybe I could just get the entire thing as a
SqlDataReader in one swipe. I’d do the control breaking on Account in code.
Well that doesn’t work because MSSQL server seems to want to buffer the
entire query before returning a single row. That sorta defeats the purpose
aye? I looked and looked for something that would set a fetchsize or buffer
size or cache size but couldn’t find it.

3. So then I figured I’d write some code to perform multiple smaller readers
that spanned groups of Accounts. So instead of lets say 100,000 Accounts
resulting in just as many SqlDatReader calls, I now have maybe 5,000 calls
that maybe look like AccountNumber>’AA’ and AccountNumber <=’AB’ or something
similar. Then I’d still control break manually over the returned reader
values. This too is much too slow.

Number 1 ran for hours until I stopped it.

Number 2 would never get past the ExecuteReader() until I stopped it after
like an hour of so. I tried with(NOLOCK) to see if that’d help. Maybe there’s
something diff?

Number 3 started running quickly enough but took approx 3 hours to finish.

So … just for fun I wrote a server side cursor that iterated over the entire
table in the exact same manner minus the manual control break logic.

It finished in 10 minutes!

That’d be great!

But I have too much backend logic to use a T-SQL cursor and no real desire
to try and put all the “future†logic into SQLCLR.

All I really want is for Number 2 to work.

I’d like to get a DataReader on the entire table. If the cursor test can
complete in 10 minutes then hopefully there is a way to get the DataReader to
start returning records quicker.

Why won’t MSSQL server start “feeding†my DataReader the records right away
or at least after it has buffered enough to make it happy?

Sorry for the long winded post but it was my first and I’m trying to cover
everything for anyone that might wish to offer some insight.

I know it’d be best to chunk it up where possible but I kinda tried that
using test Number 3.

I really do need to iterate over the entire table in an order matching an
already built index.

I don’t need to update anything … just simply read over the data.

Thanks!
 
Kenny,
"MSSQL server seems to want to buffer the entire query before returning a
single row."

If you have an "Order by" clause in your select statement, it is expected to
be the case.

WITH (NOLOCK) is a very good hint to use for pure select statements

You said millions of records. I believe you are spending too much time
moving data between the server and the client and then a similar load when
updating the data. This data transfer can be avioded if you move your
backend code to where it belongs, but then you are the only one who knows
how much of work is involved.

I had been there before and using backend logic improved so much of
performance (from 3 weeks to 3-7 days of processing time).

Regards,

Trevor Benedict
 
Thanks for taking the time to read and reply Trevor!

Also ... I think I goofed on the cursor compare.

I wasn't "forcing" the record lookup past the index.

So ... basically ... in the cursor test I was using a covered index by
mistake.

Sorry for the false information ... I thought I had a "smoking gun" before I
posted.

That made all the difference in the world.

It’s basically been running for over 30 minutes now so maybe the cursor will
take forever too which kinda makes me feel better in a strange sorta way.

You said:

If you have an "Order by" clause in your select statement, it is expected to
be the case.

But the “Order by†is the exact same as the index I’m using … and … I’ve
tried “forcing†that index by providing the INDEX(IX_Name) hint too. You’d
think it’d be happy with that.

Since the cursor code seems slow now too … I thinking there may be no way to
do what I want to do.

I think the record lookup is what’s killing me.

I’m kind of afraid to move the code to the server because there’s a lot of
crazy logic to follow besides just the Read that I’m currently testing.

My gut tells me such code shouldn’t be “inside†SQL Server.

Also … since my logic was flawed on the cursor test … if indeed the actual
record lookup is the killer then it wouldn’t help me anyway and you prolly
wouldn’t have suggested it in the first place.

Either my test box is just way underpowered for what I'm trying to
accomplish or I’m going to have to rethink everything.
 
UPDATE:

I've played with the code I wrote to perform multiple smaller readers
that spanned groups of Accounts.

I'm now using a "real" Thread instead of a BackGroundWorker and instead of
100,000 records per group ... I bumped it upto 2,000,000.

The task now completes in just under one hour.

While this is still only approx 8k records per second ... it is at least
kinda tolerable.

I'd imagine if I had a better test machine the results would be much better
as well.
 
Ah, did you consider not bringing the rows to the client and performing the
operation on the server?

--
__________________________________________________________________________
William R. Vaughn
President and Founder Beta V Corporation
Author, Mentor, Dad, Grandpa
Microsoft MVP
(425) 556-9205 (Pacific time)
Hitchhiker’s Guide to Visual Studio and SQL Server (7th Edition)
http://betav.com/blog/billva
http://betav.com
____________________________________________________________________________________________
 
Kenny,

A little bit in addition to Bill (William(

Your question can have a simple answer, SQL to Linq is exactly where you are
asking for.

It gives you only back the resultset as you have described in your Linq
statement.

Cor
 
Server side code would prolly be the most efficient in that respect ... but
.... I know what code needs added in the future and the logic is simply too
much to ask of the SQLCLR.

Tis not "rational" ...

I've actually gotta do so crazy things like generate/write PCL code to
individual files from potentially millions of SQL records.

Writing to external files and other crazy things that I've left out make
using SQLCLR almost seem like abuse of the Server! {;-)

That
 
Please excuse my possible ignorance but I do not believe LINQ will help me
with speed ... maybe speed of development but not speed "in code".

LINQ still needs to dip into the db and the major problem is the large IO
overhead and movement of data to the client.

I was trying to find ways to lower the "individual query cost" and know that
the best case for speed for the complete result set iteration has gotta be
the DataReader.

But ... the result set is too big for a single DataReader and SQL server
won't just start feeding me records because I use an ORDER BY in the select
statement ...

However I have an INDEX that exactly matches the ORDER BY.

That's distrubing to me!

Life would be grand if it'd just follow that index ...

I could just issue one DataReader and the server would instantly start
spitting data at me.

It works if I use a CLUSTERED index but that results in more worktime to
build than actually doing what I'm doing right now.

And there are too many records to keep in memory on the client app ... while
.... bouncing back and forth to the database for every customer with a
DataReader simply takes too long.

On customers with millions of recs it is not too bad but some only have a
few records and this often causes a long pause between some select queries as
sql tries to find and load each case.

I figured CHUNKING large files for transmission over IP or read/write to
disk is always better than streaming single bytes ... so ... I figured
CHUNKING the large result set into many DataReaders would help too.

And it does ... but ... returning larger and larger result sets via the
DataReader improves performance more and more ... I'm sure only upto a point
cause sooner or later you max out the server itself.
 
Back
Top