Hi Jon:
Every time I post something that general I regret it, knowing that I may
know the nuances but without explaining them, I may be misleading people. I
don't disagree with your point, and while that seems to contradict my
original one, let me elaborate.
1) I've seen some documentation MS put out that indicated that on average,
pulling over more than 5k rows (and obviously there is a lot of wiggle room
depending on the schema and all) is where performance degradation sets in
pretty much starts a liner increase. My real experience has been such that
it somewhat validates this but to be honest, the degradation isn't big
enough the be noticeable.
2) Dealing with the CF has some slight differences that I'll address in a
second but by and large, I'd still stand by the contention that pulling over
small chunks of data tends to be preferable than large ones. There are
certainly exceptions. Moreover you could take it to the extreme and say "Is
pulling one record at a time over better than pulling 100?" to which I'd say
no in most cases. Then "why is 100 better than 1000" etc. I pulled 100
arbitrarily out of the air and it's not an absolute by any means. The point
is that unless you absolutely need 5000 records for instance, in one
sitting, then pulling them all over may not be the best strategy. There are
a bunch of issues that come to play, concurrency being one of the bigger
ones. This doesn't directly affect performance at the time you load the app
but it does affect things overall if you have a bunch of users that need to
engage in concurrency violation resolutions.
On a PPC, opnening a conneciton, particularly to a SQL CE database can be
very costly and you're right, once you have the data loaded, the
performance is probably faster. If you look at the new resultset for
instance though, the behavior that they built into it is more like a hybrid
between a DataReader and a Dataset but one main precondition is an open
connection throughout its use. I bring this up b/c in the current context,
if you left a connection open while you were paging, you would reduce some
overhead there. It just so happens that that savings isn't worth the cost so
it's not a recommended approach.
Back to performance though. let's say that the total records was 5,000 and
that the time to load it was annoyingly slow. Many factors will affect
this, that may or may not affect other people in other situations so lets
just assume that whatever the time frame is, it's unacceptably slow to fill
a dataSet. Unless you are doing some sort of batch processing or something
like that where you need all 5000, you could choose another number if you
don't like my 1000 so that at most, you'd have 5 trips to the db. You fill
1000 rows. You need some more so you make another trip and add those 1000
to the first 1000. This could happen 3 more times until you've filled
everything and were in the same position as you would be if you yanked over
all 5k at the onset. You and I are both in total agreement (based on some
comments I read that you made yesterday) about not using the UI thread for
data access. Hence, you could still work on the data, perhaps filtering and
searching with what you have while you grabbed another 1000 records. You
could possibly just schedule this so it executes one at a time, so you
eventually get all 5k records and you'd be able to start using rows
immediately. Or this might not be what you want. So when you hit >> or
some other UI cue, you pull over the next 'page' worth of data. Either way,
it makes a good amount of sense, again many factors will affect this
strengthening my argument in some cases, negating it in others, to grab only
what you need and proceed from there.
Put another way, let's say that in my use case, I may or may not need all
5,000 but at the onset, I definitely need say 250. So I grab 300 of them.
Now, I do my thing and then in this instance I get a call, have to head out
to see a client so just submti my changes. If I pulled over more data, say
all 5000, then any overhead associated with the difference is wasted. This
may not be a big deal or it might be more significant, again many factors
come into play here. If I knew for sure I needed all 5,000 and in 10
minutes I knew the server was going to be down, then my 100 at a time
approach would clearly suck b/c anything I didn't get before the server went
down I'd be without.
As such, I really believe there's NO best approach for all scenarios. In
general, I think that leaning on the side of smaller dataset sizes tends to
perform better, but even if I was provably correct, this would make little
difference if one's scenario is such that Connections are very expensive
(either b/c they don't always have network connectivity or the server is
bogged down or whatever). I think it really depends on the situation but by
and large, I'd err on the side of minimalism b/c on the whole I think it's
cheaper over all.
So this is the inherent danger in making sweeping generalizations as I did,
even though I meant it as an 'in general' approach.
Again thanks for your feedback. Discussions with Jon and Miha are always
thought provoking and you both always have some really good points (and
often much better than mine ;-)).
Cheers,
Bill
--
W.G. Ryan, eMVP
http://forums.devbuzz.com/
http://www.knowdotnet.com/williamryan.html
http://www.msmvps.com/WilliamRyan/