EXCLUDING DUPES IN STRING ARRAY !!!!

  • Thread starter Thread starter jay dean
  • Start date Start date
J

jay dean

Hi -

B is a string var. In my code, if a certain condition is met, then store
B in th next available index of Arr(). However, before I store B, I need
to check that the current value of B does not already exist in Arr().

Is there a "faster" way to accomplish this, or I need to loop from
lbound(Arr) to Ubound(Arr) every time to check if the new value to be
stored already exists?

Thanks
Jay Dean
 
Considering all else (the array will be used somewhere, so the check for
duplicates is not the only deciding factor)
I doubt there are better ways than just looping through the array, but a few
things to consider:
1. Is the string array sorted? If it is then you could check with a binary
search. That will be a lot faster than a full loop.
2. Could you use a collection or dictionary instead of the array? With that
the check for duplicates might be faster.
3. You could have the array in a string variable, eg: element1 & | element2
& | etc. With that you could than check with
Instr. Concatenating the strings though will be a big overhead, so I doubt
it will help.
4. You could use Olaf Schmidt's dhRichClient3.dll. That has a very fast
collection object and dictionary object.
5. You could invest in Jim Mach's Stamina dll. That has some fast array
routines that could speed this up.
Can't think of much else.

RBS
 
You could maintain what I would call a "check string" for this purpose.
Let's say the name of this String variable is CheckString. Then you can do
this in a loop...

For X = 1 To SomethingLessThanInfinity
'
' Some kind of conditioning code goes here I presume
'
If YourCondition Then
If Instr(CheckString, Chr(1) & B & Chr(1)) = 0 Then
Arr(X) = B
CheckString = CheckString & Chr(1) & B & Chr(1)
End If
Next

If the text value in variable B is not in CheckString, then this is the
first time you have seen its value, so assign it to the array and then store
its value, with a delimiter on both sides of it, into CheckString. I have
used Chr(1) as my delimiter because under normal circumstances it will not
appear in any of the text being assigned to B during the loop. You can use
any character (or characters) that you **know** for certain will never
appear in your text strings for the delimiter. The reason you need this
delimiter is to stop accidental substring finds crossing over between your B
values. For example, if two consecutive values being assigned to B during
the loop were "moth" and "error" and did not use a delimiter between them,
then they would go into the CheckString as "...motherror..." and the latter
assignment of "mother" to B would register as already having been added to
the array... the delimiters guarantee this won't happen.
 
I think the fastest method is to test for a value's existence in an
array is to use the Match function. For example, examine the
following code:

' <START CODE>
Dim Arr(1 To 5) As String
Dim Ndx As Long
Dim B As String
Dim V As Variant

''''''''''''''
' load up some test values
For Ndx = 1 To 3
Arr(Ndx) = Chr(Asc("a") + Ndx - 1)
Next Ndx

''''''''''''''
B = "f" ' doesn't exist in Arr
V = Application.Match(B, Arr, 0)
If IsError(V) Then
' does not exist
Arr(Ndx) = B
Else
' exists, do nothing
End If

''''''''''''''
B = "b" ' exists in Arr
V = Application.Match(B, Arr, 0)
If IsError(V) Then
' does not exist
Arr(Ndx) = B
Else
' exists, do nothing
End If

''''''''''''''
' list content
For Ndx = LBound(Arr) To UBound(Arr)
Debug.Print Ndx, Arr(Ndx)
Next Ndx
' <END CODE>

First, part of the array Arr is given some test values, "a", "b", and
"c". Then, B is assigned "f". The value "f" is searched for in Arr by
the Match function. The variant V holds the result of Match. If it is
an error (IsError = True), then "f" does not exist in the array and is
added to the array. It is assumed that at this point in the code, the
variable Ndx points to the first unused element of Arr. Next, the
value "b" is assigned to the variable B and again Match is used to see
if "b" exists in Arr. Since it does already exist, Match assigns its
position to V, and when V is tested for an error, IsError returns
False so we know "b" already exists.

Finally, the code just lists the content of Arr.

Cordially,
Chip Pearson
Microsoft Most Valuable Professional,
Excel, 1998 - 2010
Pearson Software Consulting, LLC
www.cpearson.com
 
Just to show that the Instr method looks indeed faster (some 3 times) than
doing a simple array loop.
Not looked at using the Match function.

Option Explicit
Private lStartTime As Long
Private Declare Function timeGetTime Lib "winmm.dll" () As Long

Sub ArrayTest()

Dim i As Long
Dim n As Long
Dim x As Long
Dim bDup As Boolean
Dim arrString(1 To 10000) As String
Dim strAdd As String

StartSW

For i = 1 To 10000

strAdd = RandomWord(2)
bDup = False

For n = 1 To x
If arrString(n) = strAdd Then
bDup = True
Exit For
End If
Next n

If bDup = False Then
x = x + 1
arrString(x) = strAdd
End If

Next i

StopSW

For i = 1 To x
Cells(i, 1) = arrString(i)
Next i

End Sub

Sub ArrayTest2()

Dim i As Long
Dim n As Long
Dim x As Long
Dim arrString(1 To 10000) As String
Dim strAdd As String
Dim strUnique As String

StartSW

strUnique = "|"

For i = 1 To 10000

strAdd = RandomWord(2)

If InStr(1, strUnique, "|" & strAdd & "|", vbBinaryCompare) = 0 Then
x = x + 1
arrString(x) = strAdd
strUnique = strUnique & strAdd & "|"
End If

Next i

StopSW

For i = 1 To x
Cells(i, 1) = arrString(i)
Next i

End Sub

Function RandomWord(lChars As Long) As String

Dim i As Long

RandomWord = String(lChars, Chr(32))

For i = 1 To lChars
Mid$(RandomWord, i, 1) = Chr(Int((57 * Rnd) + 65))
Next i

End Function

Sub StartSW()
lStartTime = timeGetTime()
End Sub

Function StopSW(Optional bMsgBox As Boolean = True, _
Optional vMessage As Variant, _
Optional lMinimumTimeToShow As Long = -1) As Variant

Dim lTime As Long

lTime = timeGetTime() - lStartTime

If lTime > lMinimumTimeToShow Then
If IsMissing(vMessage) Then
StopSW = lTime
Else
StopSW = lTime & " - " & vMessage
End If
End If

If bMsgBox Then
If lTime > lMinimumTimeToShow Then
MsgBox "Done in " & lTime & " msecs", , vMessage
End If
End If

End Function


RBS
 
Unless I am overlooking something, using Application.Match looks very slow
to me.
Using the helper code as posted previously.

Sub ArrayTest3()

Dim i As Long
Dim n As Long
Dim x As Long
Dim arrString(1 To 10000) As String
Dim strAdd As String
Dim V As Variant

StartSW

For i = 1 To 10000

strAdd = RandomWord(2)

V = Application.Match(strAdd, arrString, 0)

If IsError(V) Then
x = x + 1
arrString(x) = strAdd
End If

Next i

StopSW

For i = 1 To x
Cells(i, 1) = arrString(i)
Next i

End Sub


RBS
 
Using a collection is a lot faster, but has the drawback that the uniqueness
is
case-insensitive, so if you have for example AA then aa won't be added:

Sub ArrayTest4()

Dim i As Long
Dim n As Long
Dim x As Long
Dim collString As Collection
Dim strAdd As String
Dim V As Variant

StartSW

Set collString = New Collection

On Error Resume Next

For i = 1 To 10000
strAdd = RandomWord(2)
collString.Add strAdd, strAdd
Next i

StopSW

Cells.Clear

For i = 1 To collString.Count
Cells(i, 1) = collString.Item(i)
Next i

End Sub


Using cCollection in Olaf Schmidt's dhRichClient3:
http://www.thecommon.net/3.html
is faster still and has the advantage of have both case-sensitive and
case-insensitive uniqueness testing:

Sub ArrayTest5()

Dim i As Long
Dim n As Long
Dim x As Long
Dim collString As cCollection
Dim strAdd As String
Dim V As Variant

StartSW

Set collString = New cCollection

With collString
.CompatibleToVBCollection = False
.UniqueKeys = True
.StringCompareMode = BinaryCompare
End With

For i = 1 To 10000
strAdd = RandomWord(2)
If collString.Exists(strAdd) = False Then
collString.Add strAdd, strAdd
End If
Next i

StopSW

Cells.Clear

For i = 1 To collString.Count
Cells(i, 1) = collString.ItemByIndex(i - 1)
Next i

End Sub


I think this might be the best option, if you don't mind adding the
reference to dhRichClient3.


RBS
 
I forgot to mention that, as written, the test for uniqueness is case
sensitive; however, changing the InStr test to this will make the test case
insensitive...

If InStr(1, CheckString, Chr(1) & B & Chr(1), vbTextCompare) > 0 Then

You should only use this form of the test if you really need a case
insensitive test since, while still quite fast, it will be slower than using
the case sensitive test I posted initially. Also, as the number of items
dumped into the text String gets very, very large, the code will start to
slow down due to the repeated concatenations. There is a method to overcome
this which I'll post in a little while (I've got to re-develop it<g>).
 
Thank -- RB, Rick, Bernd, Dave, and Chip !!
Your responses have been very helpful.

Jay Dean
 
Here is a method that should be faster than what I have posted previously...
this is the code I mentioned in my other response (to myself)...

Dim X As Long, StartPosition As Long, ArrayIndex As Long
Dim B As String, CheckString As String, Arr() As String
'....
'....
ReDim Arr(1 To SomeMaxIndex)
CheckString = String(200000, Chr(1))
StartPosition = 2
ArrayIndex = LBound(Arr)
For X = 1 To SomeMaxIndex
'
' Some kind of conditioning code goes here I presume
'
If YourCondition Then
If InStr(CheckString, Chr(1) & B & Chr(1)) = 0 Then
Arr(ArrayIndex) = B
ArrayIndex = ArrayIndex + 1
Mid(CheckString, StartPosition) = B
StartPosition = StartPosition + Len(B) + 1
End If
End If
Next
ReDim Preserve Arr(1 To (ArrayIndex - 1))

There is one drawback to this method though, you have to estimate the
maximum number of characters that could be in the CheckString variable. To
do this, you need to be somewhat familiar with your data. Let's say the
longest text string you expect to have is 19 characters long and that you
expect to have about 10,000 unique text strings when you are done processing
your data. Add one to the maximum number of characters and then multiply in
order to get the upper limit (I called it SomeMaxIndex in my code above) to
Dim your Arr array to...

SomeMaxIndex = (19 + 1) * 10000

which is how I got my estimate of 200000 in my CheckString assignment
statement above. Also note that the StartPosition will always be 2 (we need
a Chr(1) in front of the CheckString text).
 
Unless I am overlooking something, using Application.Match looks very slow
to me.

I don't know. I ran it with an array of about 1000 elements and it was
essentially instantaneous.


Cordially,
Chip Pearson
Microsoft Most Valuable Professional,
Excel, 1998 - 2010
Pearson Software Consulting, LLC
www.cpearson.com
 
Did you run the posted code?

RBS


Chip Pearson said:
I don't know. I ran it with an array of about 1000 elements and it was
essentially instantaneous.


Cordially,
Chip Pearson
Microsoft Most Valuable Professional,
Excel, 1998 - 2010
Pearson Software Consulting, LLC
www.cpearson.com
 
Back
Top