Major Performance Bug in DataTable?

  • Thread starter Thread starter Brian Gideon
  • Start date Start date
B

Brian Gideon

I came across this disturbing problem. The code adds 10000 rows to a
DataTable and then changes the value of one field on each row. It
should be super fast right? As written it takes 19s to complete on my
machine.

class Program
{
static void Main(string[] args)
{
DataTable table = new DataTable();
table.Columns.Add("Value", typeof(Object));
table.Columns[0].AllowDBNull = false;

table.BeginLoadData();
for (int i = 0; i < 10000; i++)
{
table.Rows.Add(i);
}
table.EndLoadData();

Stopwatch sw = new Stopwatch();
sw.Start();
foreach (DataRow row in table.Rows)
{
row["Value"] = 3.14;
}
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalSeconds);
Console.ReadLine();
}
}

The reason why the above code is so slow is because the row["Value"] =
3.14 line is generating an exception that is being swallowed.

Now, if you make either of the following changes it will take less
than 1/100th second to complete. You only do one of the 3 to see a
difference.

1) Comment out table.Columns[0].AllowDBNull = false
2) Comment out table.BeginLoadData and table.EndLoadData
3) Change row["Value"] = 3.14 to row["Value"] = 3

Can someone else confirm this bug? I'm using .NET 2.0.
 
Why do you not strongly type the field to accept a floating point? That
would like work much better than the boxing you are doing to use an object
type. You will probably find the error going away with strong typing, as
well.

--
Gregory A. Beamer
MVP, MCP: +I, SE, SD, DBA

*************************************************
| Think outside the box!
|
*************************************************
 
Doing

row["Value"]

repeatedly is a performance hit. Get the column ordinal and use that
instead. It will speed up your process.

RobinS.
GoldMail, Inc.
 
Why do you not strongly type the field to accept a floating point? That
would like work much better than the boxing you are doing to use an object
type. You will probably find the error going away with strong typing, as
well.

The field comes from a special kind of non-relational database that
can be many different types. But, yes, changing it to a strongly
typed field speeds it up again. But then again so does commenting out
BeginLoadData and EndDataLoad. That doesn't make any sense does it?
Think about it this way. Do you think any of those 3 changes I
mentioned should change the performance by an order of 10^4 (or more)
for a DataTable that only holds 10,000 rows?
 
Doing

   row["Value"]

repeatedly is a performance hit. Get the column ordinal and use that
instead. It will speed up your process.


So referencing a column by name is at least an order of 10^4 slower
than referencing by ordinal? Why then does making any one of the 3
changes I listed speed it up so much?
 
Brian Gideon said:
   row["Value"]

repeatedly is a performance hit. Get the column ordinal and use that
instead. It will speed up your process.

So referencing a column by name is at least an order of 10^4 slower
than referencing by ordinal? Why then does making any one of the 3
changes I listed speed it up so much?

Getting the column doesn't make any significant difference here.

The problem is that you're changing the type of the value stored. I
haven't looked through the problem in its entirety, but the basic
message is "don't do that". Work out the type of the column to start
with, and make sure you only store that type of data in there. Changing
the data type (when you're storing primitive value types) will be
expensive.
 
Getting the column doesn't make any significant difference here.
Exactly.


The problem is that you're changing the type of the value stored. I
haven't looked through the problem in its entirety, but the basic
message is "don't do that". Work out the type of the column to start
with, and make sure you only store that type of data in there.

Changing types is definitely related to the problem. But, that's not
the whole story. I can still change the data type during a call to
the row["Value"] setter and as long as I comment out the BeginLoadData
and EndLoadData from the first loop the second loop will be at least
10^4 times faster. Likewise, setting AllowDBNull = true speeds up
that row["Value"] setter call...again by a factor of at least 10^4.
Changing
the data type (when you're storing primitive value types) will be
expensive.

By a factor of at least 10^4? No way. The difference based on a
10,000 iteration loop should be so imperceptable that even the high
precision Stopwatch class wouldn't pick it up.

Because I've profiled the code I can see that the row["Value"] setter
is internally generating and swallowing an exception. I can even see
what part of the code it's occurring in using the Reflector. Sure
enough, Microsoft has code embedded way down deep that swallows
exceptions from that call in some cases. And again, by making
seemingly unrelated calls to the DataTable before the loop somehow
causes the problem to start occurring.

Running the posted code in the debugger is even worse. In fact, I
have no idea how long it takes because I'm too impatient to wait. I'm
assuming that's because the debugger is capturing that first chance
exception and trying to decide whether or not to halt based on the
rules you define in the Debug | Exceptions menu option.
 
Brian Gideon said:
The problem is that you're changing the type of the value stored. I
haven't looked through the problem in its entirety, but the basic
message is "don't do that". Work out the type of the column to start
with, and make sure you only store that type of data in there.

Changing types is definitely related to the problem. But, that's not
the whole story. I can still change the data type during a call to
the row["Value"] setter and as long as I comment out the BeginLoadData
and EndLoadData from the first loop the second loop will be at least
10^4 times faster. Likewise, setting AllowDBNull = true speeds up
that row["Value"] setter call...again by a factor of at least 10^4.

Sure - so there's definitely something interesting going on.
Changing
the data type (when you're storing primitive value types) will be
expensive.

By a factor of at least 10^4? No way. The difference based on a
10,000 iteration loop should be so imperceptable that even the high
precision Stopwatch class wouldn't pick it up.

Because I've profiled the code I can see that the row["Value"] setter
is internally generating and swallowing an exception. I can even see
what part of the code it's occurring in using the Reflector. Sure
enough, Microsoft has code embedded way down deep that swallows
exceptions from that call in some cases. And again, by making
seemingly unrelated calls to the DataTable before the loop somehow
causes the problem to start occurring.

Swallowing 10,000 exceptions certainly wouldn't take 10 seconds. It's
possible that it's swallowing millions of exceptions, but not just
10,000.

I wonder whether it's trying to do something nasty in terms of hashing
or internal sorting - although I'd expect that to happen if you *don't*
have BeginLoadData, not if you do...
Running the posted code in the debugger is even worse. In fact, I
have no idea how long it takes because I'm too impatient to wait. I'm
assuming that's because the debugger is capturing that first chance
exception and trying to decide whether or not to halt based on the
rules you define in the Debug | Exceptions menu option.

Certainly exceptions are significantly slower in the debugger.
 
Because I've profiled the code I can see that the row["Value"] setter
is internally generating and swallowing an exception.  I can even see
what part of the code it's occurring in using the Reflector.  Sure
enough, Microsoft has code embedded way down deep that swallows
exceptions from that call in some cases.  And again, by making
seemingly unrelated calls to the DataTable before the loop somehow
causes the problem to start occurring.

It looks like the exception is being generated and swallowed in
System.Data.Common.ObjectStorage.Compare. Using Reflector I can see
that there is a try-catch that doesn't rethrow. The posted code
generates and swallows a mind numbing 313,627 exceptions! No wonder
it's so slow.

I did some testing where I change the number of rows in the DataTable
and it appears the number of exceptions being generating is
proportional to n*log(n). I can see some activity happening inside of
a red-black tree implementation of an index which might explain the
n*log(n) ratio.

I can't imagine how this is by design. It must be a bug.
 
Any time you have variant data, you run a risk of both errors and
performance hits. I am not sure 10^4 is an expected difference, but error
handling is expensive. One possibility, to reduce errors, would be to run
the value through some sort of filter to extract as type before you box. I
am not sure if you have the metadata in your table to easily accomplish
this, however.

It very well may be a bug in the underlying framework, but bugs in fringe
conditions are not as readily caught, or are they fixed as fast. Most people
using the framework are not dealing with variants. This does not mean the
bug should not be examined and fixed, if it is a bug, but rather that it
does not surprise me there was a coding issue in a fringe condition.

--
Gregory A. Beamer
MVP, MCP: +I, SE, SD, DBA

*************************************************
| Think outside the box!
|
*************************************************
Why do you not strongly type the field to accept a floating point? That
would like work much better than the boxing you are doing to use an object
type. You will probably find the error going away with strong typing, as
well.

The field comes from a special kind of non-relational database that
can be many different types. But, yes, changing it to a strongly
typed field speeds it up again. But then again so does commenting out
BeginLoadData and EndDataLoad. That doesn't make any sense does it?
Think about it this way. Do you think any of those 3 changes I
mentioned should change the performance by an order of 10^4 (or more)
for a DataTable that only holds 10,000 rows?
 
Bug or missing feature. In this case, the test is fringe enough, I believe
it was probably just missed. I do agree that it should be looked at and some
form of answer should be culled out of the data. You should log it on
connect, with all data you have. That is more likely to get a fix than
posting here, which is more likely to get discussion, debate and perhaps
argument. :-)

If you find others who are dealing with this issue, have them add to your
Connect report. In these types of issues, the squeeky wheel is more likely
to be greased.

--
Gregory A. Beamer
MVP, MCP: +I, SE, SD, DBA

*************************************************
| Think outside the box!
|
*************************************************
Because I've profiled the code I can see that the row["Value"] setter
is internally generating and swallowing an exception. I can even see
what part of the code it's occurring in using the Reflector. Sure
enough, Microsoft has code embedded way down deep that swallows
exceptions from that call in some cases. And again, by making
seemingly unrelated calls to the DataTable before the loop somehow
causes the problem to start occurring.

It looks like the exception is being generated and swallowed in
System.Data.Common.ObjectStorage.Compare. Using Reflector I can see
that there is a try-catch that doesn't rethrow. The posted code
generates and swallows a mind numbing 313,627 exceptions! No wonder
it's so slow.

I did some testing where I change the number of rows in the DataTable
and it appears the number of exceptions being generating is
proportional to n*log(n). I can see some activity happening inside of
a red-black tree implementation of an index which might explain the
n*log(n) ratio.

I can't imagine how this is by design. It must be a bug.
 
I can't imagine how this is by design. It must be a bug.

Well, it depends. It might be a "known non-optimal situation" - but it
could well be that the code which makes this situation behave poorly is
the same code which makes more "normal" situations behave really well,
and that code which behaves well in both cases is significantly more
complicated. More complicated code is more likely to have bugs, of
course... I don't mind too much having code which performs badly in
some rare cases, performs well in more common cases, but *works* both
ways.

I'm not saying that necessarily *is* the trade-off they've made, but
it's a possible one.

It wouldn't hurt to report it on connect.microsoft.com.
 
Well, it depends. It might be a "known non-optimal situation" - but it
could well be that the code which makes this situation behave poorly is
the same code which makes more "normal" situations behave really well,
and that code which behaves well in both cases is significantly more
complicated. More complicated code is more likely to have bugs, of
course... I don't mind too much having code which performs badly in
some rare cases, performs well in more common cases, but *works* both
ways.

I'm not saying that necessarily *is* the trade-off they've made, but
it's a possible one.

It wouldn't hurt to report it on connect.microsoft.com.

I'll go ahead and report it. In the meantime the workaround is
simple. Not calling BeginLoadData and EndLoadData on the original
DataTable is not possible since it comes from code I didn't write it
(in fact the posted code is nothing like what I'm actually working
with, it just happened to be the shortest complete code snippet that
demonstrates the problem). But, I can copy the entire contents of the
original DataTable into a new instance with the same schema, but not
call BeginLoadData and EndLoadData when I do it. It works and the
additional O(n) operation will be insignificant in the grand scheme of
things especially considering the alternative.

By the way, I'm a little concerned that this demonstrates that the
DataRow[column] setter is a O(n*log(n)) implementation on a non-
indexed field...at least in this scenario. I'm pretty sure the
complexity is O(1) when I remove the BeginLoadData and EndLoadData
calls from the previous loop, but I've already spent too much time on
this and I don't care to spend more considering I have an acceptable
workaround.

Thanks to everyone that replied!
 
Brian Gideon said:
I'll go ahead and report it. In the meantime the workaround is
simple. Not calling BeginLoadData and EndLoadData on the original
DataTable is not possible since it comes from code I didn't write it
(in fact the posted code is nothing like what I'm actually working
with, it just happened to be the shortest complete code snippet that
demonstrates the problem).

And believe me, I thank you from the bottom of my heart for that :)
But, I can copy the entire contents of the
original DataTable into a new instance with the same schema, but not
call BeginLoadData and EndLoadData when I do it. It works and the
additional O(n) operation will be insignificant in the grand scheme of
things especially considering the alternative.

An alternative suggestion which would avoid the doubling up of memory
would be to convert all the existing data from one type to another
(without changing it to the new data) outside a BeginLoadData and
EndLoadData - then the actual change to the data could occur with in a
BeginLoadData/EndLoadData.
By the way, I'm a little concerned that this demonstrates that the
DataRow[column] setter is a O(n*log(n)) implementation on a non-
indexed field...at least in this scenario. I'm pretty sure the
complexity is O(1) when I remove the BeginLoadData and EndLoadData
calls from the previous loop, but I've already spent too much time on
this and I don't care to spend more considering I have an acceptable
workaround.

Of course, now that the framework source code is available (I assume
System.Data is included) you could debug through it. I'd do it myself,
but I'm in the unusual situation of not being able to agree to a
licence for a feature that I've asked for for a long time :)
 
Back
Top