Removing duplicate records

  • Thread starter Thread starter Peter
  • Start date Start date
P

Peter

I have the following code which reads XML files and merges them in to one
dataset. Some of the files have duplicate data, my question is what is the
best way to remove the duplicate records after the records are merged or how
to prevent them from merging in a first place?


DataSet ds = new DataSet();
DataSet dsXml = new DataSet();

try
{
string [] files = Directory.GetFiles(path,
"VisitTable_*.xml");
foreach(string file in files)
{
dsXml.ReadXml(file, System.Data.XmlReadMode.Auto);
ds.Merge(dsXml.Tables[0]);
}
}
catch(Exception e)
{
throw e;
}


Peter
 
Hi Peter,

Thank you for posting in the community!

First of all, I would like to confirm my understanding of your issue. From
your description, I understand that you need to merge the DataSets from the
Xml files into one and you need to remove the duplicate rows. If there is
any misunderstanding, please feel free to let me know.

As far as I know, to remove the duplicate records, we need add a primary
key to the table. The primary key is for the table to identify a row and to
decide whether two records are the same. When merging two DataSets with
primary key, the method will automatically overwrite the old record with
new fetched one. So that only one copy of that record will be added.

If there is no primary key set in the table, I think we have to write more
code to achieve this. We can use the Select method to decide whether the
records exists in the current DataSet. Here I have made some changes to
your code.

DataSet ds = new DataSet();
DataSet dsXml = new DataSet();

try
{
string [] files = Directory.GetFiles(path,"VisitTable_*.xml");
foreach(string file in files)
{
dsXml.ReadXml(file, System.Data.XmlReadMode.Auto);
foreach(DataRow dr in dsXml.Tables[0].Rows)
{
if(ds.Tables[0].Select("Filter expression here to decide whether the
records exists.").Length == 0)
ds.Tables[0].Rows.Add(dr.ItemArray);
}
}
}
catch(Exception e)
{
throw e;
}

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."
 
What is the syntax for the filter?
I want to select only one record

ds.Merge(dsXml.Tables[0].Select("Only first record"));

Thanks
 
Peter:


Here's a routine I wrote to Find duplicates in a DataTable. Once they are
found, it's easy to just walk through them and delete either the first or
second/third/forth etc instance you find:

Private Sub ShowDuplicates()
Me.Cursor = Cursors.WaitCursor

Try
If Not optDuplicates.Checked Or AllData.Tbl_Job_Tracking Is Nothing
Then Exit Sub
Dim dr As DataRow
dg.CaptionText = "Processing...."
Application.DoEvents()

Dim CompareVal As String = String.Empty
Dim BaseValue As String = String.Empty
Duplicates = New ArrayList
For Each dr In AllData.Tbl_Job_Tracking.Select("", FieldName)
If IsDBNull(dr(FieldName)) Then
BaseValue = String.Empty
Else
BaseValue = CType(dr(FieldName), String)
End If
If BaseValue = CompareVal Then
Duplicates.Add(CompareVal)
Debug.WriteLine(CompareVal)
End If
CompareVal = BaseValue
Next
Catch ex As InvalidCastException
Debug.Assert(False, ex.ToString)
End Try
ShowOnlyDuplicates(Duplicates)
lblTotalLines.Text = GridSource.Count.ToString
CalculateFilteredLines()
dg.CaptionText = "Finished"
dg.CaptionText = CType(lstDepartments.SelectedItem,
DataRowView).Item(1).ToString & " [Duplicate Records Only]"
Me.Cursor = Cursors.Default
End Sub

Public Sub ShowOnlyDuplicates(ByVal Dups As ArrayList)
If Dups.Count = 0 Then
MessageBox.Show("There doesn't appear to be any duplicate records",
"No Duplicates Found", MessageBoxButtons.OK, MessageBoxIcon.Information)
Exit Sub
End If
Dim sb As New System.Text.StringBuilder
sb.Append(FieldName & " IN( ")

For i As Integer = 0 To Dups.Count - 1
If i = 0 Then
sb.Append("'" & CType(Dups(i), String) & "'")
Else
sb.Append(", '" & CType(Dups(i), String) & "'")
End If
Next
sb.Append(")")

GridSource.RowFilter = sb.ToString
GridSource.Sort = "Excel_Job_Number"
End Sub

HTH,

Bill
 
Hi Peter,

The syntax for the filter is just like the criteria in SQL statement after
'WHERE' clause. Here is an example.

dsXml.Tables[0].Select("Score>80 and Gender='Male'");

If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."
 
So if I have the following record set how do I select first record?

Score Gender
80 Male
80 Male
80 Male

Thanks
 
Hi Peter,

I think you didn't catch my meaning. What I mean in my first post is that
we can add rows one by one. When adding each row, we use DataTable.Select
method to decide if the row has been added yet. If the row has been added,
we skip that record. So, there will not be any duplicated rows at all the
time adding records. I think the case you have mentioned will not occur
with my suggestion. HTH.

If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."
 
Hi Peter,

I'd like to know if this issue has been resolved yet. Is there anything
that I can help. I'm still monitoring on it. If you have any questions,
please feel free to post them in the community.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."
 
Back
Top