Speed in Ado.net

  • Thread starter Thread starter Manu
  • Start date Start date
M

Manu

Hi!

I am wondering if i have to Work with a database with a table having around
1000 Records...

Then if i use a fill method to dill a strong type dataset then i will
require a Good Amount of memory for storing these 1000 records locally in
the memory by dataset....

Phew this can be really resource hungry factor in any application.

is there any workaroud.

Thanks in advance

Manu Singhal
HPI Solutions
(e-mail address removed)
 
The DataAdapter.Fill method has an overload which lets you fill a certain
number of records starting at a certain number.
 
Manu:

Since dataTables cache the data locally, your choices are limited..it's
going to eat some RAM. 1,000 rows is defintely a good amount of data
depending on how big the records are. 5,000 is about the most you'd ever
want to deal with b/c after that, the ugliness gets uglier and uglier.
Don't use SELECT * unless you need all of the records. Normalize the data
if possible and use DataRelations if possible b/c they reduce the redundant
data.... . http://www.knowdotnet.com/articles/datarelation.html

If performance is Critical, then you may want to page the data, pulling over
100 records at a time (TOP 100 for instance) or adding a Indentity field (or
if you have a good key) specifying 100 records at a time.
If you don't need to use the Table to call .Update (particularly if it's
read only data) you may want to try a DataReader.

That's pretty much all I can think of .

HTH,

Bill

--
W.G. Ryan MVP Windows - Embedded

http://forums.devbuzz.com
http://www.knowdotnet.com/dataaccess.html
http://www.msmvps.com/williamryan/
 
William Ryan eMVP said:
Since dataTables cache the data locally, your choices are limited..it's
going to eat some RAM. 1,000 rows is defintely a good amount of data
depending on how big the records are. 5,000 is about the most you'd ever
want to deal with b/c after that, the ugliness gets uglier and uglier.

I disagree with this. We've got tables with over 5,000 records in which
we're loading fully into memory on the Compact Framework, and although
it takes a while to load them in, it's fine after that.

Obviously if the rows are huge it's going to take a lot of memory, but
supposing a row with 10 strings, each of which is about 10 characters,
that's 200 bytes per row of actual data. Double it (finger in the air
guesswork here) for the overhead of the references, the row object
itself, etc, and you're still only at 400 bytes per row.

400 bytes * 10000 records is still only 4M, which on most systems
running .NET is peanuts (in particular, it's significantly less than
the smallest Windows Forms app's working set).

Depending on what you're wanting to do with the table, it can
definitely be faster to drag it all into memory and work with a local
copy than making repeated queries etc.
 
Hi Jon,

Yes, it depends. However, I would say that I don't want to load that many
records (IOW I agree with Ryan).
If you load that many records you'll have several problems, for example,
concurrency ones - you'll end up with old data after some time.
I would rather load the records on demand or something.
And again, it cerainly depends on your application.
 
Hi Manu,

I totally agree with Bill and Miha.

When you are busy with the dataset think always on the minimum amount that
you have to load needed for the processing of the information...

It is not a recordset as some believe and have only transported their
ancient methods to the dataset.

The dataset has in mine opinion it strength in the possibility to get a
complete however small part of the information that is needed to process, in
that way you can transport the information to and back in a speedy way to
almost everywhere and almost update it without a lot of concurrency problems
as Miha already stated.

With that, I do not say that that small part cannot be one or more complete
tables when it is needed. However than it is in my eyes more reference
information as an article list like as it can be used in the cache from a
webform application.

Just my thought,

Cor
 
Hi Jon:

Every time I post something that general I regret it, knowing that I may
know the nuances but without explaining them, I may be misleading people. I
don't disagree with your point, and while that seems to contradict my
original one, let me elaborate.

1) I've seen some documentation MS put out that indicated that on average,
pulling over more than 5k rows (and obviously there is a lot of wiggle room
depending on the schema and all) is where performance degradation sets in
pretty much starts a liner increase. My real experience has been such that
it somewhat validates this but to be honest, the degradation isn't big
enough the be noticeable.
2) Dealing with the CF has some slight differences that I'll address in a
second but by and large, I'd still stand by the contention that pulling over
small chunks of data tends to be preferable than large ones. There are
certainly exceptions. Moreover you could take it to the extreme and say "Is
pulling one record at a time over better than pulling 100?" to which I'd say
no in most cases. Then "why is 100 better than 1000" etc. I pulled 100
arbitrarily out of the air and it's not an absolute by any means. The point
is that unless you absolutely need 5000 records for instance, in one
sitting, then pulling them all over may not be the best strategy. There are
a bunch of issues that come to play, concurrency being one of the bigger
ones. This doesn't directly affect performance at the time you load the app
but it does affect things overall if you have a bunch of users that need to
engage in concurrency violation resolutions.

On a PPC, opnening a conneciton, particularly to a SQL CE database can be
very costly and you're right, once you have the data loaded, the
performance is probably faster. If you look at the new resultset for
instance though, the behavior that they built into it is more like a hybrid
between a DataReader and a Dataset but one main precondition is an open
connection throughout its use. I bring this up b/c in the current context,
if you left a connection open while you were paging, you would reduce some
overhead there. It just so happens that that savings isn't worth the cost so
it's not a recommended approach.

Back to performance though. let's say that the total records was 5,000 and
that the time to load it was annoyingly slow. Many factors will affect
this, that may or may not affect other people in other situations so lets
just assume that whatever the time frame is, it's unacceptably slow to fill
a dataSet. Unless you are doing some sort of batch processing or something
like that where you need all 5000, you could choose another number if you
don't like my 1000 so that at most, you'd have 5 trips to the db. You fill
1000 rows. You need some more so you make another trip and add those 1000
to the first 1000. This could happen 3 more times until you've filled
everything and were in the same position as you would be if you yanked over
all 5k at the onset. You and I are both in total agreement (based on some
comments I read that you made yesterday) about not using the UI thread for
data access. Hence, you could still work on the data, perhaps filtering and
searching with what you have while you grabbed another 1000 records. You
could possibly just schedule this so it executes one at a time, so you
eventually get all 5k records and you'd be able to start using rows
immediately. Or this might not be what you want. So when you hit >> or
some other UI cue, you pull over the next 'page' worth of data. Either way,
it makes a good amount of sense, again many factors will affect this
strengthening my argument in some cases, negating it in others, to grab only
what you need and proceed from there.

Put another way, let's say that in my use case, I may or may not need all
5,000 but at the onset, I definitely need say 250. So I grab 300 of them.
Now, I do my thing and then in this instance I get a call, have to head out
to see a client so just submti my changes. If I pulled over more data, say
all 5000, then any overhead associated with the difference is wasted. This
may not be a big deal or it might be more significant, again many factors
come into play here. If I knew for sure I needed all 5,000 and in 10
minutes I knew the server was going to be down, then my 100 at a time
approach would clearly suck b/c anything I didn't get before the server went
down I'd be without.

As such, I really believe there's NO best approach for all scenarios. In
general, I think that leaning on the side of smaller dataset sizes tends to
perform better, but even if I was provably correct, this would make little
difference if one's scenario is such that Connections are very expensive
(either b/c they don't always have network connectivity or the server is
bogged down or whatever). I think it really depends on the situation but by
and large, I'd err on the side of minimalism b/c on the whole I think it's
cheaper over all.

So this is the inherent danger in making sweeping generalizations as I did,
even though I meant it as an 'in general' approach.

Again thanks for your feedback. Discussions with Jon and Miha are always
thought provoking and you both always have some really good points (and
often much better than mine ;-)).

Cheers,

Bill

--

W.G. Ryan, eMVP

http://forums.devbuzz.com/
http://www.knowdotnet.com/williamryan.html
http://www.msmvps.com/WilliamRyan/
 
Hi Bill,

See this link I got from Jay B. today
http://msdn.microsoft.com/architecture/default.aspx?pull=/library/en-us/dnpag/html/scalenet.asp

All is in there about this except that I struggle with the datalayer where
is talked about caching the data on the client.

That is in my opinion against what Miha said and what I agree with him.

Cor



William Ryan eMVP said:
Hi Jon:

Every time I post something that general I regret it, knowing that I may
know the nuances but without explaining them, I may be misleading people. I
don't disagree with your point, and while that seems to contradict my
original one, let me elaborate.

1) I've seen some documentation MS put out that indicated that on average,
pulling over more than 5k rows (and obviously there is a lot of wiggle room
depending on the schema and all) is where performance degradation sets in
pretty much starts a liner increase. My real experience has been such that
it somewhat validates this but to be honest, the degradation isn't big
enough the be noticeable.
2) Dealing with the CF has some slight differences that I'll address in a
second but by and large, I'd still stand by the contention that pulling over
small chunks of data tends to be preferable than large ones. There are
certainly exceptions. Moreover you could take it to the extreme and say "Is
pulling one record at a time over better than pulling 100?" to which I'd say
no in most cases. Then "why is 100 better than 1000" etc. I pulled 100
arbitrarily out of the air and it's not an absolute by any means. The point
is that unless you absolutely need 5000 records for instance, in one
sitting, then pulling them all over may not be the best strategy. There are
a bunch of issues that come to play, concurrency being one of the bigger
ones. This doesn't directly affect performance at the time you load the app
but it does affect things overall if you have a bunch of users that need to
engage in concurrency violation resolutions.

On a PPC, opnening a conneciton, particularly to a SQL CE database can be
very costly and you're right, once you have the data loaded, the
performance is probably faster. If you look at the new resultset for
instance though, the behavior that they built into it is more like a hybrid
between a DataReader and a Dataset but one main precondition is an open
connection throughout its use. I bring this up b/c in the current context,
if you left a connection open while you were paging, you would reduce some
overhead there. It just so happens that that savings isn't worth the cost so
it's not a recommended approach.

Back to performance though. let's say that the total records was 5,000 and
that the time to load it was annoyingly slow. Many factors will affect
this, that may or may not affect other people in other situations so lets
just assume that whatever the time frame is, it's unacceptably slow to fill
a dataSet. Unless you are doing some sort of batch processing or something
like that where you need all 5000, you could choose another number if you
don't like my 1000 so that at most, you'd have 5 trips to the db. You fill
1000 rows. You need some more so you make another trip and add those 1000
to the first 1000. This could happen 3 more times until you've filled
everything and were in the same position as you would be if you yanked over
all 5k at the onset. You and I are both in total agreement (based on some
comments I read that you made yesterday) about not using the UI thread for
data access. Hence, you could still work on the data, perhaps filtering and
searching with what you have while you grabbed another 1000 records. You
could possibly just schedule this so it executes one at a time, so you
eventually get all 5k records and you'd be able to start using rows
immediately. Or this might not be what you want. So when you hit >> or
some other UI cue, you pull over the next 'page' worth of data. Either way,
it makes a good amount of sense, again many factors will affect this
strengthening my argument in some cases, negating it in others, to grab only
what you need and proceed from there.

Put another way, let's say that in my use case, I may or may not need all
5,000 but at the onset, I definitely need say 250. So I grab 300 of them.
Now, I do my thing and then in this instance I get a call, have to head out
to see a client so just submti my changes. If I pulled over more data, say
all 5000, then any overhead associated with the difference is wasted. This
may not be a big deal or it might be more significant, again many factors
come into play here. If I knew for sure I needed all 5,000 and in 10
minutes I knew the server was going to be down, then my 100 at a time
approach would clearly suck b/c anything I didn't get before the server went
down I'd be without.

As such, I really believe there's NO best approach for all scenarios. In
general, I think that leaning on the side of smaller dataset sizes tends to
perform better, but even if I was provably correct, this would make little
difference if one's scenario is such that Connections are very expensive
(either b/c they don't always have network connectivity or the server is
bogged down or whatever). I think it really depends on the situation but by
and large, I'd err on the side of minimalism b/c on the whole I think it's
cheaper over all.

So this is the inherent danger in making sweeping generalizations as I did,
even though I meant it as an 'in general' approach.

Again thanks for your feedback. Discussions with Jon and Miha are always
thought provoking and you both always have some really good points (and
often much better than mine ;-)).

Cheers,

Bill

--

W.G. Ryan, eMVP

http://forums.devbuzz.com/
http://www.knowdotnet.com/williamryan.html
http://www.msmvps.com/WilliamRyan/
 
Excellent link Cor, as always! Thanks. Yep, I agree with Miha too. When I
ever find myself disagreeing with him I know I'm in trouble although that
doesn't happen . Jon is the man though on performance and I pay a lot of
attention to what he says on the subject. In this instance I don't see the
numbers adding up overall but he's right, there are cases where loading it
all at once makes sense. I got the impression he thinks this is the norm
rather than the exception. I think it's the other way around. And when you
factor in what you and Miha have both mentioned, it doesn't take too many
concurrency issues to offset the small performance gains you'd see in cases
where Jon's scenario is in effect.

Also, I didn't see your post when I wrote back and as such, I'd extend the
compliment to you too...you always have some insightful comments (and
usually some good links) and I learn quite a bit from all of you guys ;-)

Thanks again,

Bill

--

W.G. Ryan, eMVP

http://forums.devbuzz.com/
http://www.knowdotnet.com/williamryan.html
http://www.msmvps.com/WilliamRyan/
http://www.devbuzz.com/content/zinc_personal_media_center_pg1.asp
Cor Ligthert said:
Hi Bill,

See this link I got from Jay B. today
http://msdn.microsoft.com/architecture/default.aspx?pull=/library/en-us/dnpag/html/scalenet.asp

All is in there about this except that I struggle with the datalayer where
is talked about caching the data on the client.

That is in my opinion against what Miha said and what I agree with him.

Cor



William Ryan eMVP said:
Hi Jon:

Every time I post something that general I regret it, knowing that I may
know the nuances but without explaining them, I may be misleading
people.
I
don't disagree with your point, and while that seems to contradict my
original one, let me elaborate.

1) I've seen some documentation MS put out that indicated that on average,
pulling over more than 5k rows (and obviously there is a lot of wiggle room
depending on the schema and all) is where performance degradation sets in
pretty much starts a liner increase. My real experience has been such that
it somewhat validates this but to be honest, the degradation isn't big
enough the be noticeable.
2) Dealing with the CF has some slight differences that I'll address in a
second but by and large, I'd still stand by the contention that pulling over
small chunks of data tends to be preferable than large ones. There are
certainly exceptions. Moreover you could take it to the extreme and say "Is
pulling one record at a time over better than pulling 100?" to which I'd say
no in most cases. Then "why is 100 better than 1000" etc. I pulled 100
arbitrarily out of the air and it's not an absolute by any means. The point
is that unless you absolutely need 5000 records for instance, in one
sitting, then pulling them all over may not be the best strategy. There are
a bunch of issues that come to play, concurrency being one of the bigger
ones. This doesn't directly affect performance at the time you load the app
but it does affect things overall if you have a bunch of users that need to
engage in concurrency violation resolutions.

On a PPC, opnening a conneciton, particularly to a SQL CE database can be
very costly and you're right, once you have the data loaded, the
performance is probably faster. If you look at the new resultset for
instance though, the behavior that they built into it is more like a hybrid
between a DataReader and a Dataset but one main precondition is an open
connection throughout its use. I bring this up b/c in the current context,
if you left a connection open while you were paging, you would reduce some
overhead there. It just so happens that that savings isn't worth the
cost
so
it's not a recommended approach.

Back to performance though. let's say that the total records was 5,000 and
that the time to load it was annoyingly slow. Many factors will affect
this, that may or may not affect other people in other situations so lets
just assume that whatever the time frame is, it's unacceptably slow to fill
a dataSet. Unless you are doing some sort of batch processing or something
like that where you need all 5000, you could choose another number if you
don't like my 1000 so that at most, you'd have 5 trips to the db. You fill
1000 rows. You need some more so you make another trip and add those 1000
to the first 1000. This could happen 3 more times until you've filled
everything and were in the same position as you would be if you yanked over
all 5k at the onset. You and I are both in total agreement (based on some
comments I read that you made yesterday) about not using the UI thread for
data access. Hence, you could still work on the data, perhaps filtering and
searching with what you have while you grabbed another 1000 records. You
could possibly just schedule this so it executes one at a time, so you
eventually get all 5k records and you'd be able to start using rows
immediately. Or this might not be what you want. So when you hit >> or
some other UI cue, you pull over the next 'page' worth of data. Either way,
it makes a good amount of sense, again many factors will affect this
strengthening my argument in some cases, negating it in others, to grab only
what you need and proceed from there.

Put another way, let's say that in my use case, I may or may not need all
5,000 but at the onset, I definitely need say 250. So I grab 300 of them.
Now, I do my thing and then in this instance I get a call, have to head out
to see a client so just submti my changes. If I pulled over more data, say
all 5000, then any overhead associated with the difference is wasted. This
may not be a big deal or it might be more significant, again many factors
come into play here. If I knew for sure I needed all 5,000 and in 10
minutes I knew the server was going to be down, then my 100 at a time
approach would clearly suck b/c anything I didn't get before the server went
down I'd be without.

As such, I really believe there's NO best approach for all scenarios. In
general, I think that leaning on the side of smaller dataset sizes tends to
perform better, but even if I was provably correct, this would make little
difference if one's scenario is such that Connections are very expensive
(either b/c they don't always have network connectivity or the server is
bogged down or whatever). I think it really depends on the situation
but
 
Jon:

You mention something that I'd like to follow up on. I'm in no way
disagreeing with your point about "Finger in the air, double it for
references". One of my coworkers has spotted some really weired behavior
(ie degraded performance) when passing datatables around and I suggested not
passing it in a few instances, rather just returning it. He'd pass a
DataTable by reference, fill it in a class and then use it in the caller.
In SQLClient it wasn't a big deal but he used the same methodology in OleDb
and it was slow as dirt. I pointed out DataTables are reference types so
there's no need to pass it by reference and by not passing a datatable at
all, but returning a filled one, performance increased dramatically. This
is what I expected from the onset. However, he did some refactoring and
passing a datatable to a local method repeatedly killed the performance
again. We suggested just passing the rows and again this fixed it. He's
also spotted some other problems and he's adverse to passing datatables at
all now, particularly in the refactoring code. I am of the impression that
usage more than anything else is what caused the hit.

Anyway, how would one go about checking the memory used by a datatable
(other than check beforehand what it used and see what's used afterward and
use that as an estimate)? Specifically, is there a way in advnace you could
accurately say what a datatable with X schema would use if it occcupied Y
rows? I 've tried using your method below but the references throw me.

Any ideas on how to performance test datatables and passing them around?
Thanks !

Bill

--

W.G. Ryan, eMVP

http://forums.devbuzz.com/
http://www.knowdotnet.com/williamryan.html
http://www.msmvps.com/WilliamRyan/
http://www.devbuzz.com/content/zinc_personal_media_center_pg1.asp
 
William Ryan eMVP said:
Every time I post something that general I regret it, knowing that I may
know the nuances but without explaining them, I may be misleading people. I
don't disagree with your point, and while that seems to contradict my
original one, let me elaborate.

<snip>

Nice elaboration - and I don't disagree with any of it.

It really depends on exactly what you're doing on the data, and as you
say there's no overall best approach.
 
William Ryan eMVP said:
Excellent link Cor, as always! Thanks. Yep, I agree with Miha too. When I
ever find myself disagreeing with him I know I'm in trouble although that
doesn't happen . Jon is the man though on performance and I pay a lot of
attention to what he says on the subject.

Please don't! Or at least, don't trust my ideas on performance when it
comes to databases too far. I have some experience on various aspects
of performance, but I'm *far* from a database expert (a very, *very*
long way away from it!).

Great discussion :)
 
William Ryan eMVP said:
You mention something that I'd like to follow up on. I'm in no way
disagreeing with your point about "Finger in the air, double it for
references". One of my coworkers has spotted some really weired behavior
(ie degraded performance) when passing datatables around and I suggested not
passing it in a few instances, rather just returning it. He'd pass a
DataTable by reference, fill it in a class and then use it in the caller.
Right.

In SQLClient it wasn't a big deal but he used the same methodology in OleDb
and it was slow as dirt. I pointed out DataTables are reference types so
there's no need to pass it by reference and by not passing a datatable at
all, but returning a filled one, performance increased dramatically. This
is what I expected from the onset. However, he did some refactoring and
passing a datatable to a local method repeatedly killed the performance
again. We suggested just passing the rows and again this fixed it. He's
also spotted some other problems and he's adverse to passing datatables at
all now, particularly in the refactoring code. I am of the impression that
usage more than anything else is what caused the hit.

That sounds very strange indeed. Passing a reference to a datatable
*shouldn't* have any performance hits that I'm aware of... Is it worth
us trying to isolate this and work out what's going on?
Anyway, how would one go about checking the memory used by a datatable
(other than check beforehand what it used and see what's used afterward and
use that as an estimate)?

I'm afraid the latter is *exactly* what I'd do!
Specifically, is there a way in advnace you could
accurately say what a datatable with X schema would use if it occcupied Y
rows? I 've tried using your method below but the references throw me.

Unfortunately I suspect it depends on what you do with it. For
instance, as we've seen recently, calling Select seems to perform some
caching - so the memory usage changes. Goodness knows what else makes a
difference. For instance, loading the schema with the key may well take
more memory than loading the schema without the key...
Any ideas on how to performance test datatables and passing them around?

Only the normal way - keep everything simple (i.e. avoid GUIs!), do
large measurements to keep accuracy reasonable, etc.
 
Thanks Jon. I'm definitely going to look into it but I think I figured out
what I'm after utlimately.

I have a datatable which is composed of x datarows and y datacolums. So, if
the CLR simply manages a reference to the DataTable itself, there's one
reference. However I don't think that could ever be the whole story b/c
something needs to manage the rows and columns. So if I have a table of 10
columns and 100 rows, does the same structured table with 200 rows consume
twice the resources in terms of references that have to be tracked.
Obviously it will in the sense of the data that's stored, but I'm wondering
if it manges twice the number of referrences or how does it work. This
seems like precisely the type of thing where the process of the
investigation will yield more insightst than the 'answer' will per se. I'll
let you know what I find out.

Thanks again,

Bill

--

W.G. Ryan, eMVP

http://forums.devbuzz.com/
http://www.knowdotnet.com/williamryan.html
http://www.msmvps.com/WilliamRyan/
http://www.devbuzz.com/content/zinc_personal_media_center_pg1.asp
 
William Ryan eMVP said:
Thanks Jon. I'm definitely going to look into it but I think I figured out
what I'm after utlimately.

I have a datatable which is composed of x datarows and y datacolums. So, if
the CLR simply manages a reference to the DataTable itself, there's one
reference. However I don't think that could ever be the whole story b/c
something needs to manage the rows and columns. So if I have a table of 10
columns and 100 rows, does the same structured table with 200 rows consume
twice the resources in terms of references that have to be tracked.
Obviously it will in the sense of the data that's stored, but I'm wondering
if it manges twice the number of referrences or how does it work. This
seems like precisely the type of thing where the process of the
investigation will yield more insightst than the 'answer' will per se. I'll
let you know what I find out.

I would expect it to have something like an ArrayList of rows, to be
honest - which will have some overhead for null references to rows
which don't exist, of course, but that should be relatively low.
 
Back
Top