Efficient coordinated queries??

  • Thread starter Thread starter Tom
  • Start date Start date
T

Tom

Bear with me, here is some background...
I need to process a lot of data that's being brought in from SQL. A large
chunk of it can be brought in with a single query (albeit joining to many
tables) and each row represents an item to process. However, there is a
significant amount of data that exists in a 0..n relationship to this
initial row of data, basically five different sets of data that have a 0..n
relationship. For a hypothetical example consider a customer order database
where we need to process through all customers and can bring in a row of
customer data as our main query but there are 0..n1 rows of customer order
data, 0..n2 rows of customer feedback information, 0..n3 rows of customer
discounts available, etc.

There is way too much data to bring in at once, into a one big DataSet, and
process. Instead I want to bring in the data as individual DataSets
representing an item to process. That is, table 0 contains the one row from
the main query and table 1 through table 5 have their respective 0..n rows
that correspond to the data in table 0.

So, what is the best way to get these individual DataSets built?
1. Establish a SqlDataReader on the first query and, for each row obtained,
separately query for the other five sets of data to get their 0..n rows.

2. Establish a SqlDataReader on each query ensuring that the five correlated
queries are ordered the same as the first for its primary key (this key is
common across all these sources, of course) and examine the primary
key/foreign key values for the correlated readers to merge the appropriate
rows into a DataSet once the primary row has been obtained.

3. Fully denormalize in a gi-normous SQL query and go back to just a single
row of data per item.

4. ???

I've only just started thinking about (3) but would like to avoid this since
the n values of some of the 0..n subqueries could be significant (10s, 20s
of correlated rows, but enough to make denormalization unpalatable). I've
not tried (1) thinking that the act of querying so many times would kill
performance (there are potentially millions of items to process) and have
been going with (2). However this is still proving to be a bottleneck in
processing. I'm hoping there is another way to go that avoids having to
issue separate subqueries for each row returned on the primary query but
moves more of the burden of merging the correlated data on the SQL server
instead of my process.

Any and all feedback and suggestions will be most welcome!

-- Tom
 
Hi Tom,

I suppose yours is a client-server or n-tier scenario (IOW does not run next
to the server).
I'd create a proper dataset with all table structured required but without
data.
I'd fetch data required on demand using different dataadapters as dataset
allows adding rows.
Does that make sense to you?
 
Bear with me, here is some background...
I need to process a lot of data that's being brought in from SQL.  A large
chunk of it can be brought in with a single query (albeit joining to many
tables) and each row represents an item to process.  However, there is a
significant amount of data that exists in a 0..n relationship to this
initial row of data, basically five different sets of data that have a 0...n
relationship.  For a hypothetical example consider a customer order database
where we need to process through all customers and can bring in a row of
customer data as our main query but there are 0..n1 rows of customer order
data, 0..n2 rows of customer feedback information, 0..n3 rows of customer
discounts available, etc.

There is way too much data to bring in at once, into a one big DataSet, and
process.  Instead I want to bring in the data as individual DataSets
representing an item to process.  That is, table 0 contains the one rowfrom
the main query and table 1 through table 5 have their respective 0..n rows
that correspond to the data in table 0.

So, what is the best way to get these individual DataSets built?
1. Establish a SqlDataReader on the first query and, for each row obtained,
separately query for the other five sets of data to get their 0..n rows.

2. Establish a SqlDataReader on each query ensuring that the five correlated
queries are ordered the same as the first for its primary key (this key is
common across all these sources, of course) and examine the primary
key/foreign key values for the correlated readers to merge the appropriate
rows into a DataSet once the primary row has been obtained.

3. Fully denormalize in a gi-normous SQL query and go back to just a single
row of data per item.

4. ???

I've only just started thinking about (3) but would like to avoid this since
the n values of some of the 0..n subqueries could be significant (10s, 20s
of correlated rows, but enough to make denormalization unpalatable).  I've
not tried (1) thinking that the act of querying so many times would kill
performance (there are potentially millions of items to process) and have
been going with (2).  However this is still proving to be a bottleneck in
processing.  I'm hoping there is another way to go that avoids having to
issue separate subqueries for each row returned on the primary query but
moves more of the burden of merging the correlated data on the SQL server
instead of my process.

Any and all feedback and suggestions will be most welcome!

-- Tom

How often do the 1..n sub sets change?
Do you always need all of them in their entirety? In other words, can
you filter some of the 1..n subsets based on some business rules (for
example, not all customer discounds are available if the customer had
a past due payment int he last 6 months.)

Suppose you've costructed your datasert(s) - what happens after that?
How are you using the main set and the 1..n sets?

***

What platform? Have you considered MS Shaping language? If that's an
option for you, I would be interested in hearing other opinions myself.
 
Mmm..., yes I think. But isn't that just going to be issuing separate
sub-queries on each different data adapter. I.e., isn't this just my first
solution hidden within the implementation of ADO.NET DataAdapters?

But let me paraphrase and you can tell me if I'm understanding this
correctly...

1. I setup a DataSet with each table, full schema including primary keys and
the foreign key relationships between the tables.

2. I create a DataAdapter for each table where the SelectCommand on the
coordinated adapters is parameterized to take the foreign key corresponding
to the row from the first table.

3. I setup a DataReader on the first query and begin pulling my one row at a
time, extracting the primary key value and setting it as the parameter value
for the coordinated adapters.

4. I perform a Fill on the coordinated adapters, bringing in the 0..n rows
that correspond to the initial data row.

If this is essentially what you've suggested then I'm unsure of
performance... Isn't the Fill operation going to be performing a separate
query for each primary item?

Regards,
-- Tom
 
Tom,

This is basic AdoNet,
use One dataset with more DataTables and set in the DataSet the also the
DataRelations
Create for every DataTable one DataAdapter
Use a Adapter fill for every DatataTable
Set the mappings for every datatable that is in the dataset
Don't forget to clear the dataset before you start using the fill

I don't understand why you want to use a datareader at all

Cor


Tom said:
Mmm..., yes I think. But isn't that just going to be issuing separate
sub-queries on each different data adapter. I.e., isn't this just my
first solution hidden within the implementation of ADO.NET DataAdapters?

But let me paraphrase and you can tell me if I'm understanding this
correctly...

1. I setup a DataSet with each table, full schema including primary keys
and the foreign key relationships between the tables.

2. I create a DataAdapter for each table where the SelectCommand on the
coordinated adapters is parameterized to take the foreign key
corresponding to the row from the first table.

3. I setup a DataReader on the first query and begin pulling my one row at
a time, extracting the primary key value and setting it as the parameter
value for the coordinated adapters.

4. I perform a Fill on the coordinated adapters, bringing in the 0..n rows
that correspond to the initial data row.

If this is essentially what you've suggested then I'm unsure of
performance... Isn't the Fill operation going to be performing a separate
query for each primary item?

Regards,
-- Tom
 
But let me paraphrase and you can tell me if I'm understanding this
correctly...

1. I setup a DataSet with each table, full schema including primary keys
and the foreign key relationships between the tables.
Yes.


2. I create a DataAdapter for each table where the SelectCommand on the
coordinated adapters is parameterized to take the foreign key
corresponding to the row from the first table.

Select command can be whatever as long as it returns the data requires (you
can use different where conditions or even more than one adapter per table):
3. I setup a DataReader on the first query and begin pulling my one row at
a time, extracting the primary key value and setting it as the parameter
value for the coordinated adapters.

No need for datareader. Just call adapter.Fill.
4. I perform a Fill on the coordinated adapters, bringing in the 0..n rows
that correspond to the initial data row.

If this is essentially what you've suggested then I'm unsure of
performance... Isn't the Fill operation going to be performing a separate
query for each primary item?

It is up to you how do you set adapters and their conditions. You might load
three primary rows, extract their primary keys and use these keys on
secondary data adapter's where condition.

Does that make sense to you?
 
Remember, though, I do not want to read in all data at once, there is too
much to fit into memory (I cannot rely on being on a 64bit OS and we're
talking probably about 3GB of data alone). So if I Fill everything, isn't
that going to bring everything in, not just one item at a time?

Maybe I'm not seeing the forest for the trees.

-- Tom
 
The only question I have then is how to avoid bringing in all data at once
but bring in only one "item" at a time (one row for the primary source of
input and the sets of 0..n rows for the correlated tables). If I fill
everything then we're talking about 3GB of data, potentially more, and this
will not work unless I absolutely require x64 and workstations with >> 4GB
of RAM.

-- Tom

Cor Ligthert said:
Tom,

This is basic AdoNet,
use One dataset with more DataTables and set in the DataSet the also
the DataRelations
Create for every DataTable one DataAdapter
Use a Adapter fill for every DatataTable
Set the mappings for every datatable that is in the dataset
Don't forget to clear the dataset before you start using the fill

I don't understand why you want to use a datareader at all

Cor
 
Tom,

It is normal that all the queries have "where" clauses

You should not load the forest only the trees that you need.

Cor
 
The Fill method uses adapter's SelectCommand where you are free to specify
whatever WHERE condition you feel right.
 
Resending...
Ah, I would take a different approach. I would NOT try to bring the data
to the client to process, but write a stored procedure that processes the
rowsets.

--
__________________________________________________________________________
William R. Vaughn
President and Founder Beta V Corporation
Author, Mentor, Dad, Grandpa
Microsoft MVP
(425) 556-9205 (Pacific time)
Hitchhiker's Guide to Visual Studio and SQL Server (7th Edition)
http://betav.com http://betav.com/blog/billva
____________________________________________________________________________________________
 
Resending to determine why some messages are not being forwarded to this
newsgroup.

--
__________________________________________________________________________
William R. Vaughn
President and Founder Beta V Corporation
Author, Mentor, Dad, Grandpa
Microsoft MVP
(425) 556-9205 (Pacific time)
Hitchhiker's Guide to Visual Studio and SQL Server (7th Edition)
____________________________________________________________________________________________
 
Ooh, no... processing is way to complicated for a stored procedure. That's
a whole Can O' Worms(TM) there!! :-) We are distributing processing over a
large server farm -- bringing it back onto the SQL server would probably
kill our performance.

Regards,
-- Tom
 
Back
Top