T
Tom
Bear with me, here is some background...
I need to process a lot of data that's being brought in from SQL. A large
chunk of it can be brought in with a single query (albeit joining to many
tables) and each row represents an item to process. However, there is a
significant amount of data that exists in a 0..n relationship to this
initial row of data, basically five different sets of data that have a 0..n
relationship. For a hypothetical example consider a customer order database
where we need to process through all customers and can bring in a row of
customer data as our main query but there are 0..n1 rows of customer order
data, 0..n2 rows of customer feedback information, 0..n3 rows of customer
discounts available, etc.
There is way too much data to bring in at once, into a one big DataSet, and
process. Instead I want to bring in the data as individual DataSets
representing an item to process. That is, table 0 contains the one row from
the main query and table 1 through table 5 have their respective 0..n rows
that correspond to the data in table 0.
So, what is the best way to get these individual DataSets built?
1. Establish a SqlDataReader on the first query and, for each row obtained,
separately query for the other five sets of data to get their 0..n rows.
2. Establish a SqlDataReader on each query ensuring that the five correlated
queries are ordered the same as the first for its primary key (this key is
common across all these sources, of course) and examine the primary
key/foreign key values for the correlated readers to merge the appropriate
rows into a DataSet once the primary row has been obtained.
3. Fully denormalize in a gi-normous SQL query and go back to just a single
row of data per item.
4. ???
I've only just started thinking about (3) but would like to avoid this since
the n values of some of the 0..n subqueries could be significant (10s, 20s
of correlated rows, but enough to make denormalization unpalatable). I've
not tried (1) thinking that the act of querying so many times would kill
performance (there are potentially millions of items to process) and have
been going with (2). However this is still proving to be a bottleneck in
processing. I'm hoping there is another way to go that avoids having to
issue separate subqueries for each row returned on the primary query but
moves more of the burden of merging the correlated data on the SQL server
instead of my process.
Any and all feedback and suggestions will be most welcome!
-- Tom
I need to process a lot of data that's being brought in from SQL. A large
chunk of it can be brought in with a single query (albeit joining to many
tables) and each row represents an item to process. However, there is a
significant amount of data that exists in a 0..n relationship to this
initial row of data, basically five different sets of data that have a 0..n
relationship. For a hypothetical example consider a customer order database
where we need to process through all customers and can bring in a row of
customer data as our main query but there are 0..n1 rows of customer order
data, 0..n2 rows of customer feedback information, 0..n3 rows of customer
discounts available, etc.
There is way too much data to bring in at once, into a one big DataSet, and
process. Instead I want to bring in the data as individual DataSets
representing an item to process. That is, table 0 contains the one row from
the main query and table 1 through table 5 have their respective 0..n rows
that correspond to the data in table 0.
So, what is the best way to get these individual DataSets built?
1. Establish a SqlDataReader on the first query and, for each row obtained,
separately query for the other five sets of data to get their 0..n rows.
2. Establish a SqlDataReader on each query ensuring that the five correlated
queries are ordered the same as the first for its primary key (this key is
common across all these sources, of course) and examine the primary
key/foreign key values for the correlated readers to merge the appropriate
rows into a DataSet once the primary row has been obtained.
3. Fully denormalize in a gi-normous SQL query and go back to just a single
row of data per item.
4. ???
I've only just started thinking about (3) but would like to avoid this since
the n values of some of the 0..n subqueries could be significant (10s, 20s
of correlated rows, but enough to make denormalization unpalatable). I've
not tried (1) thinking that the act of querying so many times would kill
performance (there are potentially millions of items to process) and have
been going with (2). However this is still proving to be a bottleneck in
processing. I'm hoping there is another way to go that avoids having to
issue separate subqueries for each row returned on the primary query but
moves more of the burden of merging the correlated data on the SQL server
instead of my process.
Any and all feedback and suggestions will be most welcome!
-- Tom