dataset Performence Issue

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

We've heard reports that the Dataset object in the .NET 1.1 frameworks tends
to slow down dramatically after it reaches a certain size, around 70MB or so,
regardless of the amount of RAM installed on the machine. We haven't
experienced this ourselves yet, but can see the day when our Dataset will
comprise of hundreds of megabytes of data.



Questions:



1 Is this a well-known issue?

2 Is this a general performance problem, or just a problem with certain
methods? The Merge method comes to mind, which we use quite a bit in our
code indirectly.

3 Is there any way to work around this limitation?

4 Does the NET 2.0 framework address this issue?
 
If your DataSet will occupy 100's of MB, seriously that is ... ugghh .. it
makes me feel so pukish .. sorry but just don't do that. Such a big dataset
is like blasphemy, it's so awful, I am just so disgusted to even hear that.
Why do you need such a big dataset? That is an AWFUL way to use a DataSet.

In .NET 1.1, given the size of the dataset of course everything will croak.
Yes Especially GetChanges and Merge, both will be awful - especially if you
have lots of relations and a lot of tables.

The workaround is - "A DATASET IS NOT A DATABASE, DON'T ABUSE IT AS ONE"

..NET 2.0 has many enhancements under the scenes in both the various
underlying collections and the GetChanges algorithm and otherwise that make
using a DataSet a lot better, but 70MB? 100MB in a Dataset (In-Memory
Disconnected Cache?) - That is SUPER DUPER BUMPER ULTRA AWFUL.

BTW, especially YOU need to read Chapters 9 and 10 of my book to understand
clearly and practically why I am advocating so strongly against such a
misuse/ABUSE of a DataSet.

- Sahil Malik [MVP]
ADO.NET 2.0 book -
http://codebetter.com/blogs/sahil.malik/archive/2005/05/13/63199.aspx
 
Sahil,

There's nothing wrong with having a large, in-memory database. There are
lots of reasons in a real-time environment where a database can fit in
memory and performance rules. The SCADA system I work with has a 400 MB
in-memory database. Data is flushed to disk, once an hour or on-demand.

In-memory databases are also typically used in the embedded market since I/O
to flash memory is usually not great.

I see Datasets as a good solution to some problems.

I see other open-source databases like Firebird, SharpSQL and SQLLite as
good solutions when a small footprint RDBMS is needed. They are simple to
install and setup.

MSDE / SQL-Server Express is still overkill for many database needs, but
does provide a great solution to problems that match it capabilities.

I look foward to reading you book when its available.
Ad.
 
Adrian,

I agree with you that there can be many reasons to have a large in memory
database.
By instance with a single user used database in situations where speed is
very important.

However that is not the reason a dataset is build for. A dataset is build
for multipurpose use in a multi-user disconnected situation and has
therefore all kinds of features.

By instance in a single used database it is not of any importance if a row
is changed. After copying the previous one in process time, do you dump the
whole database over the offline stored data

Therefore let us in this performance issue not mix up apples with computers.

Cor
 
But what I am certainly not a fan of is storing 400 MB in a DataSet - that
object is just not designed for such a heavy amount of in memory data.
Sahil,

You make that assertion - can you elaborate as to why you believe that? I
don't see anything in the documentation, or the services that DataSets
provide, that would make me believe that they are not designed to handle
large amounts of data (given the inherent constraints of memory, garbage
collecting very large heaps, etc).

Now it appears to be the case that the V1.1 implementation has some
performance problems in some areas with large tables, but at least some of
these are fixed in V2.0. So Microsoft are making efforts to ensure that
large volumns of data are supported. See the following article for some
information:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnadonet/html/datasetenhance.asp

While I would agree that the main use and focus for DataSets is *selective*
caching of data, I can certainly envisage situations where I might want to
hold quite a large extract of a table in a dataset (where connectivity is
not always available, for instance).
 
Sahil,

I expect Datasets will also continue to evolve in order to meet the feature
and performance needs that developers are demanding.

Ad.
 
I have found answer for my queries i would like to share this with you


Question
=======
“We've heard reports that the Dataset object in the .NET 1.1 frameworks tend
to slow down dramatically after it reaches a certain size, around 70MB or so,
regardless of the amount of RAM installed on the machine. We haven't
experienced this ourselves yet, but can see the day when our Dataset will
comprise of hundreds of megabytes of data. “



Please find answers to your queries below
============================

1 Is this a well-known issue?


Yes for v1.x, it’s a well known issue.



2 Is this a general performance problem, or just a problem with certain
methods? The Merge method comes to mind, which we use quite a bit in our
code indirectly.



For v1.x, this is a general performance problem – it cuts across all write
actions i.e. all methods that modify DataRows – including Merge.

The performance problem is mitigated to some extent for bulk load operations
which include Merge, Adapter.Fill and Table.AcceptChanges(). The impact is
most visible for singleton random inserts, deletes and modifications.



3 Is there any way to work around this limitation?



Ways to work around this limitation?

1. Try to minimize constraints on columns. Constrains include
column.AllowDbNull, ForeignKey, PrimaryKey, UniqueKey, etc.

2. Try to minimize the use of relations as it involved building indexes on
the participating columns of the tables.



4 Does the NET 2.0 framework address this issue?

Yes, out of the box – no changes required in user code. The benefits cut all
across Dataset.



To give you some perspective on the change, 70 MB is at most 500K rows, and
in v2.0 you can fill 1 million rows, inserted randomly with a primary key in
around 30 seconds. To insert the same 1 million rows in v1.x, it takes 30
minutes.


Thanks for your support
 
Nigel,

It is true that .NET 2.0's dataset has been really improved from within to
handle greater quantities of data, but the justification of that is the
desire to abuse it as a database. It shouldn't be interpreted as a nod from
Microsoft that a DataSet is okay to abuse as a DataBase. (I am not Microsoft
mind you).

There are a number of reasons for this.

1. First of all, Dataset is guess what - Managed Code, and to send all the
..NET lovers in tailspin, Managed Code can never be as fast and as optimized
as native code. That is reasonable to expect and I am certainly not saying
that unmanaged is better and drop managed code for good - NO WAY, there are
a lot of benefits of using Managed code, but when it comes to raw
performance, Managed code sucks.

2. Secondly, the Garbage collector is that animal that makes things very
very good for 90% of the situations i.e. normal memory usage, but when you
start storing many megabytes or close to a gigabyte of information
completely in RAM - it will actually hurt your application performance. In
those scenarios, you don't want an external policeman who doesn't understand
the specific needs of your app. In that situation, you want fine control on
the memory where you specifiy when it gets cleaned, or serialized to the
disk etc. You need paging mechanisms etc. which are possible to write for
the dataset but are a real royal pain to write and even then they don't work
quite as well as - guess what - native code (i.e. most of what SQL Server).

3. SQL Server and any database comes with a "Query Engine". The number of
optimizations built into that is the work of many Phds (or dudes with
similar smarts and specialization), they have written up SQL Server's query
engine to take advantage of automatic paging, locking algorithms, spilling
over to the disk when needed, "query plans", caching those query plans -
when you compare the object model of a Dataset (or any biz object for that
matter), the comparison is like comparing a candle with the sun.

4. The algorithms in a DataSet is rudimentary, they rely on simple
techniques such as string matching, string manipulation - that level of
simplicity. They work on an "Object structure", every value they access goes
over a dereferenced segment calculation. SQL Server and any standard
database is written with that level of optimization (Not like I have looked
at their code, but it has to be written that way). Maybe in certain
instances DataSet might be smarter than I try to bring out here, but a
DataSet is an in memory object - that is how simple it is. It's just an
"object". It's worse than MS-Access when it comes to managing data for you.
(Don't get me wrong, I still love datasets, I just don't think they replace
databases). Even MS-Access will manage 4 GB, a DataSet doesn't even have a
clear upper limit definied. But of course MS Access isn't an in memory, xml
convertible, serializable in memory cache that lets you extract and merge
changes - so comparing dataset with access is like comparing apples and
oranges (but hey the argument is all about a DataSet is an Orange (in memory
cache), not an Apple (Database)).

5. Lets not forget transactional locks and many other such points, I blogged
about it earlier over here -
http://codebetter.com/blogs/sahil.malik/archive/2005/01/23/47547.aspx

6. Datasets are or any such object - AN IN MEMORY disconnected cache of
data. Being completely in memory lends them to the disadvantage of a 32 bit
OS's 2 GB memory allocation limit, there are ways around that but I
personally see those ways as bandaids rather than a true solution. Secondly
being disconnected leaves you with a WORLD of problems to solve when trying
to persist relational data back into the database. If you think
DataAdapter.Update(Dataset) will save your entire dataset into the database,
and also take care of concurrency issues, and transactional deadlocks - you
are sorely mistaken. A simple 3 table hierarchy will require you to write
pages and pages of code to save properly into the database in every
scenario - it is NOT a trivial task. And then you have to worry about not
sending too much over the wire in web service like environments, so
Merge/GetChanges - and oh lets not forget keeping your disconnected cache
fresh. How do you resolve Deleted Rows? .NET 1.1 leaves you with very few
choices, .NET 2.0 has a new IndexOf method which doesn't work in every
circumstance.

Now of course you could argue that #6 proves that simply don't use a
database, only use a disconnected cache i.e. prevayler etc. For that
argument, re-read items #1 thru 5.

In short, while DataSets will continue to improve, or lets say, they will
continue to try and be as good as a full fledged database is, they will
NEVER reach that point. While even though datasets will improve, who knows
you may even be able to run a SQL Query against them directly, but hey you
could always store a little tiny such database on a RAMDisk, or a memory
mapped file in SQL Server, and pretty much get what you need including the
heavy duty research that has already gone in making "a database".

Again, I strongly and vehemently disagree with an architecture that puts 1
GB data into a DataSet. That is complete stupidity in both .NET 1.1 and 2.0.

Whew .. this was a long reply .. gotta go !!

- Sahil Malik [MVP]
ADO.NET 2.0 book -
http://codebetter.com/blogs/sahil.malik/archive/2005/05/13/63199.aspx
 
V1.x DataSet not storing megabytes of data, IS NOT AN ISSUE, it is WHAT
YOU'D EXPECT out of a completely disconnected object that stores an
in-memory representation of tabular data.

Your bicycle doesn't fly, is that an issue with the bicycle?? Don't you
think the problem is between the keyboard and the chair? Don't you think you
should have instead bought an airplane instead rather than expecting your
bicycle to fly?

Sorry but you are SORELY mistaken.

But do share the source of your information below. I am curious how did they
pass a judgement on the exact numbers you mention - 500K rows and 1 million
rows.

- Sahil Malik [MVP]
ADO.NET 2.0 book -
http://codebetter.com/blogs/sahil.malik/archive/2005/05/13/63199.aspx
 
Sahil,

An interesting and well argued post. However I fear you are setting up a
Straw Man here. I never suggested that a DataSet is an alternative to a
DBMS. All I'm suggesting is that the OP'S use of a 70Mb DataSet is not
necessarily 'awful'. Indeed there may be cases where that's a very good
thing to do.

Clearly an in-memory approach has inherent scaling limits - the virtual and
physical memory on the machine. But 70Mb is a perfectly reasonable amount
of data to hold in memory in some applications. A suitable structured
in-memory solution tailored to specific queries can outperform a DBMS.
Network and disk costs see to that.

More comments in-line.

Nigel

1. First of all, Dataset is guess what - Managed Code, and to send all the
.NET lovers in tailspin, Managed Code can never be as fast and as
optimized as native code.
Ok, Managed Code has a cost. But DBMS's require network and disk operations,
which have a way bigger cost. You have to look at all the costs before you
can conclude that one approach is faster than the other.
2. Secondly, the Garbage collector is that animal that makes things very
very good for 90% of the situations i.e. normal memory usage, but when you
start storing many megabytes or close to a gigabyte of information
completely in RAM - it will actually hurt your application performance. In
those scenarios, you don't want an external policeman who doesn't
understand the specific needs of your app. In that situation, you want
fine control on the memory where you specifiy when it gets cleaned, or
serialized to the disk etc. You need paging mechanisms etc. which are
possible to write for the dataset but are a real royal pain to write and
even then they don't work quite as well as - guess what - native code
(i.e. most of what SQL Server).
My data is going to fit in VM - I would never suggest writing any form of
paging mechansism. Then you really are re-inventing a DBMS. The only paging
mechanism I'm relying on is the OS paging, which is - guess what - native
code that has been optimized to the max. In any case I'd expect to have
enough physical memory not to need paging.
3. SQL Server and any database comes with a "Query Engine". The number of
optimizations built into that is the work of many Phds (or dudes with
similar smarts and specialization), they have written up SQL Server's
query engine to take advantage of automatic paging, locking algorithms,
spilling over to the disk when needed, "query plans", caching those query
plans - when you compare the object model of a Dataset (or any biz object
for that matter), the comparison is like comparing a candle with the sun.
And it needs all that complexity because it's trying to minimize disk IO,
and because it's got to service every kind of request thrown at it. An
in-memory DataSet used for a specific purpose doesn't need any of that to
achieve the same performance.
4. The algorithms in a DataSet is rudimentary, they rely on simple
techniques such as string matching, string manipulation - that level of
simplicity. They work on an "Object structure", every value they access
goes over a dereferenced segment calculation.

DataSets can maintain indexes, as far as I understand. I presume an index
will use a standard Hashtable - which is a reasonably optimized lookup
algorithm. So for a single indexed column lookup performance should be very
good.
5. Lets not forget transactional locks and many other such points, I
blogged about it earlier over here -
http://codebetter.com/blogs/sahil.malik/archive/2005/01/23/47547.aspx

Don't need any of that stuff. In-memory is fast - if you really need writes
then just serialize everything with a reader/writer lock.
6. Datasets are or any such object - AN IN MEMORY disconnected cache of
data. Being completely in memory lends them to the disadvantage of a 32
bit OS's 2 GB memory allocation limit,

Yep - if you have, or may need, 2 Gb of data, use a DBMS. Or use a 64-bit
machine.
Again, I strongly and vehemently disagree with an architecture that puts 1
GB data into a DataSet. That is complete stupidity in both .NET 1.1 and
2.0.
As part of your straw man argument, you've escalated the OP's 70 Mb a bit!
 
Managed Code can never be as fast and as optimized as native code.

Why not?

I have the idea about the opposite.

I make it short otherwise you probably don't understand the question.

Cor
 
Miha,

Beside that gives managed code you automatic features that native does not.

It is for me the same discussion as that machine code would be faster than
program language code.

This is true for very small programs, however as soon as it becomes large,
than the amount of code that could have been reused as done by a program
language (and is almost impossible to do by hand), is so much that the
program language wins it almost forever.

Just my thought,

Cor
 
Maybe in very specific situations it can be. Maybe in future the balance
will tip, but in general native code is faster, don't you think?

- Sahil Malik [MVP]
ADO.NET 2.0 book -
http://codebetter.com/blogs/sahil.malik/archive/2005/05/13/63199.aspx
----------------------------------------------------------------------------


Miha Markic said:
Actually, managed code can be faster than native code due to CPU specific
optimisations.
Perhaps now there aren't many optiomizations compiler does, however in
future we should se more of them.

--
Miha Markic [MVP C#] - RightHand .NET consulting & development
www.rthand.com
Blog: http://cs.rthand.com/blogs/blog_with_righthand/
SLODUG - Slovene Developer Users Group www.codezone-si.info

Cor Ligthert said:
Why not?

I have the idea about the opposite.

I make it short otherwise you probably don't understand the question.

Cor
 
Hi Sahil,

I am not sure which is faster and when. It depends on so many factors (even
on the CPU type) and it is really hard to compare avoiding "comparing apples
and oranges". Generally speaking I don't think that managed code is slower
than native or at least, significantly slower afterall managed code is
executed as native code...

--
Miha Markic [MVP C#] - RightHand .NET consulting & development
www.rthand.com
Blog: http://cs.rthand.com/blogs/blog_with_righthand/
SLODUG - Slovene Developer Users Group www.codezone-si.info

Sahil Malik said:
Maybe in very specific situations it can be. Maybe in future the balance
will tip, but in general native code is faster, don't you think?

- Sahil Malik [MVP]
ADO.NET 2.0 book -
http://codebetter.com/blogs/sahil.malik/archive/2005/05/13/63199.aspx
----------------------------------------------------------------------------


Miha Markic said:
Actually, managed code can be faster than native code due to CPU specific
optimisations.
Perhaps now there aren't many optiomizations compiler does, however in
future we should se more of them.

--
Miha Markic [MVP C#] - RightHand .NET consulting & development
www.rthand.com
Blog: http://cs.rthand.com/blogs/blog_with_righthand/
SLODUG - Slovene Developer Users Group www.codezone-si.info

Cor Ligthert said:
Managed Code can never be as fast and as optimized as native code.

Why not?

I have the idea about the opposite.

I make it short otherwise you probably don't understand the question.

Cor
 
Just to get some facts instead of opinions, I ran some tests. The test
populated a DataTable with varying numbers of records up to 1 million.

The table was simple - 5 string columns, with one of those columns as the
primary key. Around about 100 bytes of data per row. The kind of application
I was thinking of was an address lookup table.

The two numbers I was interested in were:
- the load time per row
- the time to retrieve a row by primary key

Here's a few sample numbers (times in milliseconds):

Count Load Retrieve
1,000 0.02 0.009
10,000 0.03 0.013
100,000 0.04 0.018
1,000,000 0.06 0.028

CPU is an Athlon 2200. Framework version is V1.1

Basically the load cost per row does rise, but not too bad. However this is
a very simple table. I will interested to try this with the V2.0
improvements.

Retrieval time also rises, though quite slowly. Still, 30 microseconds per
lookup is probably better than you'll get out of a database.

So a reasonably large DataSet can work. However I would always be sure to
measure the behaviour with the largest amount of data that I wanted to be
able to handle before committing to it.
 
Sahil said:
Maybe in very specific situations it can be. Maybe in future the
balance will tip, but in general native code is faster, don't you
think?

Managed code has the overhead of the CLR and JIT which consume cpu
time. native code doesn't have that. On the other hand, native code
lacks information about code usage at runtime. This means that it has
to perform compile-time optimization and hope for the best at runtime.
In theory, you can argue that a JIT can produce faster code at runtime
because it knows the code usage patterns and can therefore decide some
code path is faster, based on information available only at runtime.
native code doesn't have that, which can lead to slower code.

That is, in theory. Deciding what's faster eats time too, and the more
time spend on deciding what's faster, the less time is gained.

Another optimization trick for managed code is that it performs code
analysis at compile time like native compilers do, and instead of
optimizing the code in full, it places JIT hints into the IL, so at
runtime the JIT will be faster in deciding what to do, as it knows up
front what lays ahead (because that's already analysed at compile
time). I believe sun's hotspot system uses these.

MS research has said that within a few years time, managed code is
faster than native code. That is to be seen though. A lot is needed in
the JIT department to get everything as fast as it should be.

FB
Miha Markic said:
Actually, managed code can be faster than native code due to CPU
specific optimisations.
Perhaps now there aren't many optiomizations compiler does, however
in future we should se more of them.

--
Miha Markic [MVP C#] - RightHand .NET consulting & development
www.rthand.com
Blog: http://cs.rthand.com/blogs/blog_with_righthand/
SLODUG - Slovene Developer Users Group www.codezone-si.info

Cor Ligthert said:
Managed Code can never be as fast and as optimized as native code.

Why not?

I have the idea about the opposite.

I make it short otherwise you probably don't understand the
question.


--
 
Thats it !! Frans has argued in my favor, all other arguments laid to rest
:-)
I agree man, Native Code is faster than Managed in most circumstances
(atleast today).

- Sahil Malik [MVP]
ADO.NET 2.0 book -
http://codebetter.com/blogs/sahil.malik/archive/2005/05/13/63199.aspx
----------------------------------------------------------------------------

Frans Bouma said:
Sahil said:
Maybe in very specific situations it can be. Maybe in future the
balance will tip, but in general native code is faster, don't you
think?

Managed code has the overhead of the CLR and JIT which consume cpu
time. native code doesn't have that. On the other hand, native code
lacks information about code usage at runtime. This means that it has
to perform compile-time optimization and hope for the best at runtime.
In theory, you can argue that a JIT can produce faster code at runtime
because it knows the code usage patterns and can therefore decide some
code path is faster, based on information available only at runtime.
native code doesn't have that, which can lead to slower code.

That is, in theory. Deciding what's faster eats time too, and the more
time spend on deciding what's faster, the less time is gained.

Another optimization trick for managed code is that it performs code
analysis at compile time like native compilers do, and instead of
optimizing the code in full, it places JIT hints into the IL, so at
runtime the JIT will be faster in deciding what to do, as it knows up
front what lays ahead (because that's already analysed at compile
time). I believe sun's hotspot system uses these.

MS research has said that within a few years time, managed code is
faster than native code. That is to be seen though. A lot is needed in
the JIT department to get everything as fast as it should be.

FB
Miha Markic said:
Actually, managed code can be faster than native code due to CPU
specific optimisations.
Perhaps now there aren't many optiomizations compiler does, however
in future we should se more of them.

--
Miha Markic [MVP C#] - RightHand .NET consulting & development
www.rthand.com
Blog: http://cs.rthand.com/blogs/blog_with_righthand/
SLODUG - Slovene Developer Users Group www.codezone-si.info

Managed Code can never be as fast and as optimized as native code.

Why not?

I have the idea about the opposite.

I make it short otherwise you probably don't understand the
question.


--
------------------------------------------------------------------------
Get LLBLGen Pro, productive O/R mapping for .NET: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------------------
 
Back
Top