Optimized Merge Algorithm for ETL-Process?

  • Thread starter Thread starter Frank Hauptlorenz
  • Start date Start date
F

Frank Hauptlorenz

Hi out there,

I've developed an ETL-Like application which fetches data from our SAP
system incrementally and merges it with an MSSQL database.
For this merge process I can't use the DataTable.Merge Method, because
the method simpley uses the DataRowState to merge.
For this I developed an own method, which compares the content and then
adds/modifies or deletes the rows.

Because this thing is not very, very fast: Is there an optimized method
for this already existing?

Thank you very much,
Frank
 
Hello Frank

Because we are comparing the performance between DataTable.Merge and our own
implementation, I looked into the code of DataTable.Merge with .NET
reflector http://www.aisto.com/roeder/dotnet/.

In the implementation of DataTable.Merge, I see it has this code to merge
src table to dst table:

foreach (DataRow row2 in src.Rows)
{
DataRow targetRow = null;
if (ndx != null)
{
targetRow = dst.FindMergeTarget(row2, srcKey, ndx);
}
dst.MergeRow(row2, targetRow, this.preserveChanges, ndx);
}

It enumerates the source table's rows and finds the target row with
"dst.FindMergeTarget(row2, srcKey, ndx);" according to the primary key. In
the MergeRow function, the code updates the rows based on the DataRowState
as you said. Here is part of MergeRow's code:

if ((targetRow.RowState == DataRowState.Unchanged) && (row.RowState ==
DataRowState.Unchanged))
{
oldRecord = targetRow.oldRecord;
newRecord = preserveChanges ? this.recordManager.CopyRecord(this,
oldRecord, -1) : targetRow.newRecord;
oldRecord = this.recordManager.CopyRecord(row.Table, row.oldRecord,
targetRow.oldRecord);
this.SetMergeRecords(targetRow, newRecord, oldRecord,
DataRowAction.Change);
}
else if (row.newRecord == -1)
{
oldRecord = targetRow.oldRecord;
.......

I think what we need to do is to re-write this part of code in our own
library to do a real comparison of the column values, instead of relying on
RowState. The comparison of the column values is definitely slower than
retrieving the RowState, thus, this modification may consume some
performance.

In order to delete the rows, we reverse through the target datatable (i.e.
dt1 in dt1.Merge(dt2)), and do a row Find against the source datatable (i.e.
dt2 in dt1.Merge(dt2)). If no match is found, we call Delete() on the row
which will make the rowstate as deleted.

In conclusion, the above solution brings some performance issues in two
aspects:
1. the additional comparison of the column values for each rows in the
source datatable.
2. the additional Find behavior (based on pk) for each rows in the target
datatable.

To determine which of the above majorly influences the performance, you may
use some .NET profilers
http://search.live.com/results.aspx?q=.NET+profiler&form=QBLH, or simply
calculate the code execution time with
http://msmvps.com/blogs/omar/archiv...ng-_2600_quot_3B00_using_2600_quot_3B00_.aspx.

In addition, I find a relevant discussion of this issue in the newsgroup:
http://www.devnewsgroups.net/group/microsoft.public.dotnet.framework.adonet/topic38348.aspx

What about your current method to merge the datatables? Is my analysis
helpful to you?

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notifications.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Jialing,

I'm using an algorithm which seems to be very close to yours: I'm first
searching for rows which should be deleted.
Then I'm comparing each row if it's a new one (insert) or is modified
(compare each value).
For the modify test, I'm changing the value of each field individually.

I'm now testing a new algorithm, which just makes the test and set's the
RowStates and then will use the original
DataTable.Merge. Let's see how it develops.

I will put some info here later.

Thank you,
Frank
 
Hi Jialing,

I've tested now my new algorithm, but it seems it's a little bit slower
than my first.
Maybe it *was* the perfect one :-)

Thank you,
Frank
 
Hello Frank

I think the bottleneck in our merge algorithm is the process of comparing
every values between the source and target tables, but we may have some
improvements of the performance in the other 2 aspects:

1.
Suppose that we do "update/add" first, we can filter the rows that are
UnChanged in the target table to do the "delete" operation. We do not need
to check the rows that are marked as Modified or Added in the "delete"
phase because they definitely appear in the source table.

2. The order of "update/add" and "delete" of the table.
I still suggested that we first do "update/add" then do "delete" because in
the "delete" phase, a large number of rows will be filtered according to
the point 1. Doing "delete" first cannot benefit from this optimization.

Hope it helps

Have a nice day!

Regards,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
(e-mail address removed).

This posting is provided "AS IS" with no warranties, and confers no rights.
=================================================
 
Back
Top