K
Kenny
Summary:
Large Table + DataReader Iteration = Slow
Large Table + Cursor Iteration = Fast
Details:
I am working on a windows app that is connecting to a MS SQL server db.
It is not designed for user interaction of any kind … it just has alot of
Reading to do (no updates) and wants to get it done quickly.
I’d like to be able to use a solution that’d work all the way back to
sqlserver 2000 but if the solution only lends itself to a newer version then
I can deal with that too.
I have millions of records that I import and insert via flat files and
SqlBulkInsert w/o any complaint in performance.
However … trying to work with the imported data inside sqlserver is proving
to be very challenging.
For purposes of example we can say that the table has 30 million’ish rows
and a structure as follows:
Account
Product
Date
Time
Status
Description
Amount
I have an non clustered index built on Account,Product,Date,Time,Status
Let me mention two things right now …
1. This isn’t the exact structure so I can’t include all the fields I need
in a covering index … that is why I left out Description and Amount … to
visually show that I gotta dip back into the db when I have a “valid†record
to process.
2. I don’t build a clustered index because that takes too much time to begin
with.
I need the data ordered as: Account,Product,Date,Time and need to skip some
records whose Status = ‘BAD’ let’s say.
I have to stay away from a DataTable cause any single Account could have
millions of records. However it is often the cause that they only have a few.
So … DataReader it is.
I have no problem with the forward only business logic needed to employ the
datareader and have tried three different approaches.
1. With the Account list known and lets stay already stored inmemory in a
List<> of some kind … the most straightforward approach is to loop over the
list-o-Accounts and fire off a SqlDataReader with a select statement where
Account=@Account and Status<>’BAD’ and ordered by Account,Product,Date,Time.
This often results in a slight delay between each Account. Well if this were
a user driven type app a delay of 500 millisecs between accounts would be
fine. But if there are 100,000 accounts, a delay of 500 millisecs is killer
to this “hands free†app.
2. So I figured that maybe I could just get the entire thing as a
SqlDataReader in one swipe. I’d do the control breaking on Account in code.
Well that doesn’t work because MSSQL server seems to want to buffer the
entire query before returning a single row. That sorta defeats the purpose
aye? I looked and looked for something that would set a fetchsize or buffer
size or cache size but couldn’t find it.
3. So then I figured I’d write some code to perform multiple smaller readers
that spanned groups of Accounts. So instead of lets say 100,000 Accounts
resulting in just as many SqlDatReader calls, I now have maybe 5,000 calls
that maybe look like AccountNumber>’AA’ and AccountNumber <=’AB’ or something
similar. Then I’d still control break manually over the returned reader
values. This too is much too slow.
Number 1 ran for hours until I stopped it.
Number 2 would never get past the ExecuteReader() until I stopped it after
like an hour of so. I tried with(NOLOCK) to see if that’d help. Maybe there’s
something diff?
Number 3 started running quickly enough but took approx 3 hours to finish.
So … just for fun I wrote a server side cursor that iterated over the entire
table in the exact same manner minus the manual control break logic.
It finished in 10 minutes!
That’d be great!
But I have too much backend logic to use a T-SQL cursor and no real desire
to try and put all the “future†logic into SQLCLR.
All I really want is for Number 2 to work.
I’d like to get a DataReader on the entire table. If the cursor test can
complete in 10 minutes then hopefully there is a way to get the DataReader to
start returning records quicker.
Why won’t MSSQL server start “feeding†my DataReader the records right away
or at least after it has buffered enough to make it happy?
Sorry for the long winded post but it was my first and I’m trying to cover
everything for anyone that might wish to offer some insight.
I know it’d be best to chunk it up where possible but I kinda tried that
using test Number 3.
I really do need to iterate over the entire table in an order matching an
already built index.
I don’t need to update anything … just simply read over the data.
Thanks!
Large Table + DataReader Iteration = Slow
Large Table + Cursor Iteration = Fast
Details:
I am working on a windows app that is connecting to a MS SQL server db.
It is not designed for user interaction of any kind … it just has alot of
Reading to do (no updates) and wants to get it done quickly.
I’d like to be able to use a solution that’d work all the way back to
sqlserver 2000 but if the solution only lends itself to a newer version then
I can deal with that too.
I have millions of records that I import and insert via flat files and
SqlBulkInsert w/o any complaint in performance.
However … trying to work with the imported data inside sqlserver is proving
to be very challenging.
For purposes of example we can say that the table has 30 million’ish rows
and a structure as follows:
Account
Product
Date
Time
Status
Description
Amount
I have an non clustered index built on Account,Product,Date,Time,Status
Let me mention two things right now …
1. This isn’t the exact structure so I can’t include all the fields I need
in a covering index … that is why I left out Description and Amount … to
visually show that I gotta dip back into the db when I have a “valid†record
to process.
2. I don’t build a clustered index because that takes too much time to begin
with.
I need the data ordered as: Account,Product,Date,Time and need to skip some
records whose Status = ‘BAD’ let’s say.
I have to stay away from a DataTable cause any single Account could have
millions of records. However it is often the cause that they only have a few.
So … DataReader it is.
I have no problem with the forward only business logic needed to employ the
datareader and have tried three different approaches.
1. With the Account list known and lets stay already stored inmemory in a
List<> of some kind … the most straightforward approach is to loop over the
list-o-Accounts and fire off a SqlDataReader with a select statement where
Account=@Account and Status<>’BAD’ and ordered by Account,Product,Date,Time.
This often results in a slight delay between each Account. Well if this were
a user driven type app a delay of 500 millisecs between accounts would be
fine. But if there are 100,000 accounts, a delay of 500 millisecs is killer
to this “hands free†app.
2. So I figured that maybe I could just get the entire thing as a
SqlDataReader in one swipe. I’d do the control breaking on Account in code.
Well that doesn’t work because MSSQL server seems to want to buffer the
entire query before returning a single row. That sorta defeats the purpose
aye? I looked and looked for something that would set a fetchsize or buffer
size or cache size but couldn’t find it.
3. So then I figured I’d write some code to perform multiple smaller readers
that spanned groups of Accounts. So instead of lets say 100,000 Accounts
resulting in just as many SqlDatReader calls, I now have maybe 5,000 calls
that maybe look like AccountNumber>’AA’ and AccountNumber <=’AB’ or something
similar. Then I’d still control break manually over the returned reader
values. This too is much too slow.
Number 1 ran for hours until I stopped it.
Number 2 would never get past the ExecuteReader() until I stopped it after
like an hour of so. I tried with(NOLOCK) to see if that’d help. Maybe there’s
something diff?
Number 3 started running quickly enough but took approx 3 hours to finish.
So … just for fun I wrote a server side cursor that iterated over the entire
table in the exact same manner minus the manual control break logic.
It finished in 10 minutes!
That’d be great!
But I have too much backend logic to use a T-SQL cursor and no real desire
to try and put all the “future†logic into SQLCLR.
All I really want is for Number 2 to work.
I’d like to get a DataReader on the entire table. If the cursor test can
complete in 10 minutes then hopefully there is a way to get the DataReader to
start returning records quicker.
Why won’t MSSQL server start “feeding†my DataReader the records right away
or at least after it has buffered enough to make it happy?
Sorry for the long winded post but it was my first and I’m trying to cover
everything for anyone that might wish to offer some insight.
I know it’d be best to chunk it up where possible but I kinda tried that
using test Number 3.
I really do need to iterate over the entire table in an order matching an
already built index.
I don’t need to update anything … just simply read over the data.
Thanks!