Fastest way removing duplicated value from string array

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi all,

What is fastest way removing duplicated value from string array using vb.net?

Here is what currently I am doing but the the array contains over 16000
items. And it just do it in 10 or more minutes.

'REMOVE DUBLICATED VALUE FROM ARRAY +++++++++++++++++
Dim col As New Scripting.Dictionary
Dim ii As Integer = 0
For ii = 0 To DTHESNO_ARRAY.Length - 2
If Not col.Exists(CStr(DTHESNO_ARRAY(ii))) Then
col.Add(CStr(DTHESNO_ARRAY(ii)), ii)
End If
Next

ReDim _DTHESNOKR102A(col.Count - 1)

'ASSIGN NONE DUPLIATED VALUT TO STRING ARRAY ++++++++
Dim iii As Integer = 0
For iii = 0 To col.Count - 1
_DTHESNOKR102A(iii) = col.Keys(iii)
Next

'NOW SORT THE STRING ARRAY ++++++++++++++++++++++++++
Array.Sort(_DTHESNOKR102A)

col = Nothing


Is it possible to clone the col to the _DTHESNOKR102A(col.Count - 1)

Thank you very much for reading my post.

Rgds,
GC
 
duplciate the array .. sort it .. then iterate through to remove doubles ...
should be much faster. here is some pseudo code

string array = original.clone();
arraylist words = new arraylist();
array.Sort(array);
for(int i=0;i<array.length;i++) {
if(i == 0 || array[i - 1] != array) {
words.Add(array);
}
}

Cheers,

Greg Young
 
Niyazi said:
What is fastest way removing duplicated value from string array using vb.net?

There are two easy ways I can think of for removing duplicates: using an
auxiliary dictionary such as a hash table (which requires extra storage
space), and sorting the array first and then doing a scan which skips
over subsequent identical elements.

If you need the elements in sorted order anyway, or you need to reduce
space requirements, the second will be better.
Here is what currently I am doing but the the array contains over 16000
items. And it just do it in 10 or more minutes.

---8<---
using System;
using System.Collections.Generic;
using System.Diagnostics;

class App
{
delegate void Method();

static void Benchmark(string label, int iterations, Method block)
{
Stopwatch watch = Stopwatch.StartNew();
for (int i = 0; i < iterations; ++i)
block();
Console.WriteLine("{0,10}: {1:f3} seconds", label,
watch.ElapsedTicks / (double) Stopwatch.Frequency);
}

static T[] MakeUnique<T>(T[] values)
{
Dictionary<T,bool> dict = new Dictionary<T,bool>();
foreach (T value in values)
if (!dict.ContainsKey(value))
dict.Add(value, false);
return new List<T>(dict.Keys).ToArray();
}

static void Main()
{
Random r = new Random();
List<int> x = new List<int>();
for (int i = 0; i < 1000000; ++i)
x.Add(r.Next(100000));

int[] xs = x.ToArray();

Benchmark("Making Unique", 1, delegate
{
MakeUnique(xs);
});
}
}
--->8---

This C# code (it should be easily convertible to VB.NET), making an
array of 1000000 items unique, takes 0.166 seconds on my machine (2.2
GHz Athlon).
Is it possible to clone the col to the _DTHESNOKR102A(col.Count - 1)

You could consider using System.Collections.Hashtable or
System.Collections.Generic.Dictionary<,> instead. It's easy enough to
just do a new List<T>(dict.Keys).ToArray() to grab an array of all the
keys in the dictionary, like I have in my code above.

-- Barry
 
Hi everyone,

To Greg:
I will test the way you show it on Monday. I am hoping that it will be much
faster than what I am doing at the moment. And thank you for your input.

To Cor:
I couldn't used the pesudo code that you gave me before. Strangly what I
realize that the way I did it works but consume more time when I try to
remove duplicated
value from specific column. The reason I couldn't use your pesude code that
I didn't have enough time to re-construct the Sub section. Before I gave you
example of 4 column but I am working with 25 column and more than 10 column
has 255 character and each specific blog belongs to another table. Example
one column shows 48 character. First 32 character is empty then from 33th the
10 characters belong to customer national ID card than last 6 belongs to
customer Id. So they decide to clean data step by step. The 38399 rows that I
recived from AS400 and after I remove duplicated value from HESNO column and
run the array as shown below:

Dim i As Integer = 0
For i = 0 To _DTHESNOKR102A.Length - 1
Dim mxDTHESNO As String = ""
mxDTHESNO = _DTHESNOKR102A(i)

Dim xcEXPR1 As String = "WWWW = " & mxDTHESNO
Dim xcSORT1 As String = "WWWW ASC"
Dim resROWS As DataRow()
resROWS = Nothing
resROWS = tmpTABLE.Select(xcEXPR1, xcSORT1)

'FOR DEBUG ONLY
'Dim nLEN As Integer = tmpTABLE.Rows.Count

Dim myBALANCE As Decimal = 0.0
If resROWS.Length > 1 Then
Dim mLen As Integer = resROWS.Length - 1
Dim yumax As Integer = 0
For yumax = 0 To mLen
myBALANCE = myBALANCE + CDec(resROWS(yumax).ItemArray(16))
Next

Dim trimax As Integer = 0
For trimax = 0 To mLen
Dim tmpDTMHNO As String = ""
tmpDTMHNO = resROWS(trimax).ItemArray(20)
tmpDTMHNO = tmpDTMHNO.Remove(2, 8)

If tmpDTMHNO = "09" Then
'DELETE or REMOVE OPERATION
++++++++++++++++++++++++++++++++++++++++
resROWS(trimax).Delete()
Else
'REPLACE OPERATION
+++++++++++++++++++++++++++++++++++++++++++++++++
resROWS(trimax).Item(16) = CStr(myBALANCE)
End If
Next
End If
Next

'SORT THE TEMP TABLE
Dim mEXPR1 As String = "QQQQ <> 0"
Dim mSORT1 As String = ""
mSORT1 = "UUUU ASC"

tmpResultROWS = Nothing
tmpResultROWS = tmpTABLE.Select(mEXPR1, mSORT1)

The result becomes 4303 and balance muches what they initaly doing things in
Excel manulay for months.

I can leave it as it is but I have problem with myself that I have to find
best way to make it run much faster, as well as I have to deal with office
polotics. My boss Is one of the Assistan General Manager as well as he used
oldest Cobol language to programmed the AS400, and I get feeling that he
doesn't want me to become succeful. Becaue we are moving new accaounting
system andhe doesn't want that system because he cannot programmmed in C++, C
or in using Java. I have 2 C++ Certification in last 7 years, but
unfortunatly I couldn't find any place to show my skill. And it is very
important for me to get the new project. My main gola to get new project that
I can lift my a.. from VB.NET and do again C++, Java or C# programming. Every
single minutes have to deal with office politics last 2 years. I get tis job
because I had C++ Programming certification from Minnesota Universty.
But last 2 year they force me to proggramed using VB.NET as well as i make
it some TV commercial things using Macr.Flash to create quickTime movies.

I really appricate all your help. But what I want you ask you that I find
funny thing that I couldn't see it before. How come the delete operation in
datarows collection removes the rows in actual dataTable. I am sorry even I
couldn't find any article to red.

As you can see:
Dim xcEXPR1 As String = "WWWW = " & mxDTHESNO
Dim xcSORT1 As String = "WWWW ASC"
Dim resROWS As DataRow()
resROWS = Nothing
resROWS = tmpTABLE.Select(xcEXPR1, xcSORT1)

If the resROWS Length more than one than I have to work with that data:
Once I delete the rows it also removes from data table which I never call
the dataTable to remove it and I guess unkowingly I did right thing while I
was testing.

I couldn't expect:
If tmpDTMHNO = "09" Then
'DELETE or REMOVE OPERATION
++++++++++++++++++++++++++++++++++++++++
resROWS(trimax).Delete()
Else
'REPLACE OPERATION
+++++++++++++++++++++++++++++++++++++++++++++++++
resROWS(trimax).Item(16) = CStr(myBALANCE)
End If

above lines to effect the dataTable directly but it did. This was luck in my
side but I realy want to learn how dataTable works with datarows collection.
It seems somethoing behinds tells me that the datarows collection actualy
belongs to the dataTable and it is not seperate datarow collection.

And this section does the all job in 5 to 6 minutes and it seems it is very
fast.
But removing deplucated value in array it takes time. Once I already get the
result I wanted to learn the fastes way to removing dublucated value from
string array. Thats the reason that I stsrt to new thread. I will check the
link this weekend and read it. I thank you all for your help. I realy
appricated.

To Barry:
I will test the code this weekend I will put the result on Monday. I also
thank you for your kind help.

I thank all of you that you try to help me in my difficult situation.

Rgds,
Niyazi
 
Niyazi,

It still does not look at our sample.

We have some problems with the webserver so you cannot see the sample now.

Both my pseudo code sample and my sample on the websited are based on
building a new datatable (Cloned or not cloned). and add new totalized rows
to that.

You have to remove/delete a lot of rows if you do not do that. And
removing/deleting rows is exactly the weak point of a datatable. Although it
is better in version 2005 than before.

Cor
 
Hi Cor,
I couldn't access the page plus I couldn't access the site at all.
I will wait and check some time this week.

I know my way of doing has a weak point removing or deletening the rows in
the dataRows collection that has and direct effect to the dataTable. But that
what I found to do now.

To Greg and Barry:
I actualy didn't change my code and still using Scripting.Dictionary
But what I did is this. Instead of putting eack col.Keys into data array one
by one using for loop, I just copy the col keys into ArrayList and it did
save a lot of time for me.

After removing dublicated value in array I just put another line as shown
below:
'NOW INSERT THE col.Keys INTO _DTHESNOKR102A ARRAYLIST
_DTHESNOKR102A.AddRange(col.Keys)
col = Nothing

Till this line in previous coding tookj about 15 minutes and now it takes
about thes than 1 minutes. So I thank you for both for your kind help that
gaves me idea to achive things much better way.

But I still wish to see the Cor's pesudo code example.

I thank all of you for your kind help.

Rgds,
Niyazi
 
Back
Top