how to find union of two arrays

  • Thread starter Thread starter Andrew
  • Start date Start date
A

Andrew

Hello,
Does anyone know a means of comparing two very large arrays (>50000
elements) to determine which elements are common to both arrays? I've
tried using looping functions and that is far too slow.
 
hi,

i don't know why your looping is far too slow, it should not, can you show your code
 
hi,

i don't know why your looping is far too slow, it should not, can you show your code

Isabel,
Imagine one range (A) with 50,000 rows, 2 columns, a second range with
50,000 rows (B) 1 column. I want to find all the elements in B which
have matches in column 1 of A. For each match, I want to copy the
corresponding entry in A column 2 over to a new column 2 in B.

Example.
A=
1 a
2 f
3 u
4 w
5 q

B=
5
1
6
8
9


After running the code, B would be transformed to:
B=
5 q
1 a
8
9
6

To do this I create the following loops.

For k1 = 1 to 50000
dataX=worksheets(1).cells(k1,1)
For k2=1 to 50000
dataY= worksheets(2).cells(k2,1)

If dataX=dataY then
worksheets(1).cells(k1,2)=worksheets(2).cells(k2,2)
end if

next
next

This code is accurate, but terrribly slow.

This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say
that each iteration requires 20 CPU clock cycles, that's 50 billion
clock cycles. My CPU has a 2.5 GHz processor, so this ideally would
take 20 seconds. It takes more than that, about 45 seconds. If I
were doing this in Matlab or C, I could process this much data in less
than a second, but not because the CPU is faster. There are other
programming constructs available which are faster. I know Excel can
run faster than this, but I don't know any tricks on how to speed this
up.

Any ideas?
 
hi Andrew,

with the functions "Index" and "Match" you can reduce the number of loop


Code:
Sub Macro1()
Dim Wks1 As Worksheet, Wks2 As Worksheet, k2 As Integer
Set Wks1 = Worksheets(1)
Set Wks2 = Worksheets(2)
For k2 = 1 To 5
If Not IsError(Application.Match(Wks2.Cells(k2, 1), Wks1.Range("A:A"), 0)) Then
Wks2.Cells(k2, 2) = Application.Index(Wks1.Range("B:B"), Application.Match(Wks2.Cells(k2, 1), Wks1.Range("A:A"), 0))
End If
Next
Set Wks1 = Nothing
Set Wks2 = Nothing
End Sub
 
Imagine one range (A) with 50,000 rows, 2 columns, a second range with
50,000 rows (B) 1 column. I want to find all the elements in B which
have matches in column 1 of A. For each match, I want to copy the
corresponding entry in A column 2 over to a new column 2 in B.

Example.
A=
1 a
2 f
3 u
4 w
5 q

B=
5
1
6
8
9


After running the code, B would be transformed to:
B=
5 q
1 a
8
9
6

To do this I create the following loops.

For k1 = 1 To 50000
dataX = Worksheets(1).Cells(k1, 1)
For k2 = 1 To 50000
dataY = Worksheets(2).Cells(k2, 1)

If dataX = dataY Then
Worksheets(1).Cells(k1, 2) = Worksheets(2).Cells(k2, 2)
End If

Next
Next

This code is accurate, but terrribly slow.

This loop has to execute 50000^2 loops, 2.5 billion loops. Let's say
that each iteration requires 20 CPU clock cycles, that's 50 billion
clock cycles. My CPU has a 2.5 GHz processor, so this ideally would
take 20 seconds. It takes more than that, about 45 seconds. If I
were doing this in Matlab or C, I could process this much data in less
than a second, but not because the CPU is faster. There are other
programming constructs available which are faster. I know Excel can
run faster than this, but I don't know any tricks on how to speed this
up.

Why use code at all? You can do this with straight Excel formulas. Assuming
your range "A" is located on Sheet1, Columns A and B, starting in Row 1; and
assuming your range "B" is located in Column A starting in Row 1 on any
other sheet, put this formula in B1 on that other sheet and copy it down...

=IF(COUNTIF(Sheet1!A$1:A$500,A1),MATCH(A1,Sheet1!A$1:A$500,0),"")

Rick Rothstein (MVP - Excel)
 
For k1 = 1 to 50000
dataX=worksheets(1).cells(k1,1)
For k2=1 to 50000
dataY= worksheets(2).cells(k2,1)
If dataX=dataY then
worksheets(1).cells(k1,2)=worksheets(2).cells(k2,2)
end if
next
next [....]
This loop has to execute 50000^2 loops, 2.5 billion loops. [....]
I don't know any tricks on how to speed this up.

Well, that depends on details of the data and algorithm that you have
not mentioned.

(Note: I will use Sheet1 for Worksheets(1) and Sheet2 for
Worksheets(2), although those might not be the actual worksheet
names.)

First, your algorithm continues to look for a match even after finding
one. Effectively, it finds the __last__ match. Is that a
requirement? Or is it sufficient to find the __first__ match?

If you want to find the last match, your inner for-loop can be
improved by doing For k2=50000 to 1 step -1.

In either case, the inner loop can stop after finding a match. So you
can put the following statement before End If: Exit For.

Thus, on average a match is found after 25000 comparisons instead of
50000, assuming unique data in both Sheet1!A1:A50000 and Sheet2!
A1:A50000.

That cuts the total iterations in half, down to 1.25 billion.

Second, in your example, Sheet1!A1:A50000 is in ascending order. If
that is always the case, you could use a binary search instead of
linear search. Thus, a match is found in at most 16 comparisions.

That cuts the total iterations down to 800,000 at most, with some
probability that it is much less.

Finally, it is much more efficient to copy Sheet1!A1:B50000 and Sheet2!
A1:A50000 into a Variant variable one time, and to retain the results
in a Variant variable and copy them to Sheet2!B1:B50000 one time.

The time savings can be huge. Every time you access an Excel range,
VBA and Excel processes much exchange information -- two interprocess
communications -- and the Excel process must be scheduled to run while
the VBA process waits.

Moreover, even if Sheet1!A1:A50000 is not in ascending order, it is
probably advantageous to take the time to sort dataX below, if we
implemented a binary search (not included below).

So at a minimum (without the binary search), your VBA algorithm should
be:

Dim dataX, dataY
dataX = Worksheets(1).Range("A1:B50000")
dataY = Worksheets(2).Range("A1:A50000")
redim resultY(1 to 50000, 1)
for k1 = 1 to 50000: for k2 = 1 to 50000
if dataY(k1,1) = dataX(k2,1) then
resultY(k1,2) = dataX(k2,2)
exit for
end if
next: next
Worksheets(2).Range("B1:B50000") = resultY

(Note that I reversed the order of the for-loops. It would make a
difference if you implement a binary search of Sheet1!A1:A50000.)

Other improvements are possible and probably desirable; for example,
avoiding the hardcoded ranges. That is why I use Redim instead of Dim
for resultY.

By reducing the interprocess communications (copying of data between
VBA and Excel), I would be very surprised if that VBA implementation
fails to outperform any of the alternatives mentioned so far: VBA
Match/Index, .Find, and Excel COUNTIF/INDEX/MATCH.

(Although the Excel alternative might be more efficient if IFERROR can
be used instead of COUNTIF.)

I might post relative performance data later, if I get around to it.
 
Errata....

redim resultY(1 to 50000, 1)

Should be:

redim resultY(1 to 50000, 1 to 1)
resultY(k1,2) = dataX(k2,2)

Should be:

resultY(k1,1) = dataX(k2,2)
Every time you access an Excel range, VBA and Excel
processes much exchange information -- two interprocess
communications

Should be: "must exchange"
 
On May 22, 7:43 am, "Rick Rothstein"
put this formula in B1 on that other sheet
and copy it down...
=IF(COUNTIF(Sheet1!A$1:A$500,A1),
MATCH(A1,Sheet1!A$1:A$500,0),"")

Presumably you mean:

=IF(COUNTIF(Sheet1!A$1:A$50000,A1),
INDEX(Sheet1!B$1:B$50000,MATCH(A1,Sheet1!A$1:A$50000,0)),"")

Or

=IF(COUNTIF(Sheet1!A$1:A$50000,A1),
VLOOKUP(A1,Sheet1!A$1:B$50000,2,0),"")

Since COUNTIF will always do 50000 comparisons, on average this does
75000 comparisons for each of 50000 entries in Sheet2!A1:A50000, a
total of 3.75 billion comparisions versus 1.25 billion for Andrew's
algorithm (corrected).

Nonetheless, this __might__ be an improvement because it avoids the
interprocess communication and process scheduling time between VBA and
Excel. (But I would be surprised.)
 
Sub Macro1()
Dim Wks1 As Worksheet, Wks2 As Worksheet, k2 As Integer
Set Wks1 = Worksheets(1)
Set Wks2 = Worksheets(2) [....]
Set Wks1 = Nothing
Set Wks2 = Nothing
End Sub

Why set Wks1 and Wks2 to Nothing before exiting the macro?

Since they are non-static local variables, I would expect VBA to
effectively do the same thing automagically.

Is there a memory leak in some revisions of VBA?
 
joeu2004 was thinking very hard :
Sub Macro1()
Dim Wks1 As Worksheet, Wks2 As Worksheet, k2 As Integer
Set Wks1 = Worksheets(1)
Set Wks2 = Worksheets(2) [....]
Set Wks1 = Nothing
Set Wks2 = Nothing
End Sub

Why set Wks1 and Wks2 to Nothing before exiting the macro?

Since they are non-static local variables, I would expect VBA to
effectively do the same thing automagically.

Is there a memory leak in some revisions of VBA?

My understanding of how this works is that it's just good programming
practice to do this because even though the objects are destroyed when
the procedure ends, the memory space that they occupied in the Stack
still persists. I may be wrong, but this is how it was explained to me
by the leading minds in Excel application development.
 
Moreover, even if Sheet1!A1:A50000 is not in ascending
order, it is probably advantageous to take the time to
sort dataX below, if we implemented a binary search

No! A selection sort requires nearly 1.250 billion (10^9)
comparisions, overwhelming any benefit of the binary search.

Excel sort can do that in a blink of an eye. But a VBA implementation
takes "forever". Presumably VBA is significantly slower because it is
interpreted, not compiled.
By reducing the interprocess communications (copying
of data between VBA and Excel), I would be very surprised
if that VBA implementation fails to outperform any of the
alternatives mentioned so far: VBA Match/Index, .Find,
and Excel COUNTIF/INDEX/MATCH.

Color me "surprised"! When I made that comment, I forgot to account
for the fact that VBA is interpreted, not compiled.

Nonetheless, on my computer, a VBA binary search (50,000 times through
50,000 elements) is nearly 2.4 times faster than Isabelle's VBA Match
algorithm (significantly improved and modified to use a binary
search), and nearly 800 to 1200 times faster than Ron's VBA Find
algorithm and Rick's Excel COUNTIF/MATCH formula (improved to use
VLOOKUP and a binary search).

However, Rick's Excel formula can be nearly 3.7 times faster than the
VBA implementation if we use a helper column for the VLOOKUP (with a
binary search) and IF(ISERROR) instead of COUNTIF. In XL2007 and
later, I suspect we would get comparable results with IFERROR(VLOOKUP)
without the need for a helper column.

The improved VBA linear search is 1.8 to 2.2 times faster than Ron's
VBA Find algorithm and Rick's Excel COUNTIF/MATCH formula (improved to
use VLOOKUP, but a linear search).

However, Isabelle's VBA Match algorithm (significantly improved, but a
linear search) is nearly 5.8 times faster than the improved VBA linear
search. Even Isabelle's obviously inefficient implementation (doing
the Match twice) is nearly 2.9 times faster.

And again, Rick's Excel formula can be nearly 6 times faster than the
improved VBA linear search if we use a helper column for the VLOOKUP
(with a linear search).

YMMV significantly for a variety of reasons. Since my improved VBA
linear search algorithm runs nearly 7 times longer than what Andrew
claims ("45 seconds"), I suspect that my system is slow because of
small memory, small cache or both relative to more modern systems.

Also, my CPU is single-core and single-threaded (instruction
"threads", not process threads). At first, I did not think that would
be a factor, since VBA does not take advantage of multiple cores and
instruction threads. However, Excel can, perhaps favoring Rick's
approach. And I suspect it might improve Isabelle's and Ron's
approaches for similar reasons; but that is based on speculation.
 
Andrew said:
This loop has to execute 50000^2 loops, 2.5 billion loops.
Let's say that each iteration requires 20 CPU clock
cycles, that's 50 billion clock cycles.

You've been given many suggestions, particularly about taking advantage of
using Excel's built in functions. However which ever approach, with this
type of thing it's really worthwhile to start with both arrays sorted,
typically that's a one off task, or not much extra work when adding small
amounts of data.

Once both sorted, search for the first matching item (and further
duplicates). Then search the second item starting at index+1 of the last
item found. Obviously in any loop after finding a matching item, as soon as
the next item doesn't match (ie after any duplicates) exit that loop and
start the next. You should end up with a fraction of the amount of working
you are doing now.

If regularly trying to find individual items in a large sorted array, search
"binary chop", which will find matches in negligible time even in large
arrays.

Peter Thornton
 
put this formula in B1 on that other sheet
Presumably you mean:

=IF(COUNTIF(Sheet1!A$1:A$50000,A1),
INDEX(Sheet1!B$1:B$50000,MATCH(A1,Sheet1!A$1:A$50000,0)),"")

Or

=IF(COUNTIF(Sheet1!A$1:A$50000,A1),
VLOOKUP(A1,Sheet1!A$1:B$50000,2,0),"")

I can't believe I forgot to included it... I was going for the INDEX formula
version. Thanks for catching that.

Rick Rothstein (MVP - Excel)
 
I really like reading your messages... the amount of analysis you are
willing to perform, and then write up for us to read, is simply
awe-inspiring. Thank you for always taking the time to do this.

Rick Rothstein (MVP - Excel)
 
[joeu2004 wrote:]
Is there a memory leak ?
yes there are
in some revisions of VBA?
it's valid for all versions of VBA

I do not notice any such memory leak in XL2003 SP3 with VBA 6.5.1024.

I have only Task Manager Mem Usage and VM Size statistics to look at.
If there is a kernel function that returns the current process's
memory usage, please let me know. I was unable to find one.

I tried the module below. After the MsgBox appears, allow 1-2 seconds
for Task Manager to update.

With #Const globDef set to False (local ws array), both Mem Usage and
VM Size increases by about 4 MB. After exiting the procedure, the Mem
Usage and VM Size reverts to their previous values.

Also note that setting the ws array to Nothing does __not__ reduce
memory usage. Moreover, it is not even necessary to set the ws to
Worksheets(1). The local declaration alone is sufficient to increase
Mem Usage and VM Size.

This is consistent with my understanding that the Dim statement alone
causes the allocation of the memory for the ws array.

(Aside.... I also tried changing Dim to ReDim, with the same
results.)

With #Const globDef set to True (non-local ws array), both Mem Usage
and VM Size increases by about 4 MB the first time the procedure is
executed.

In this case, after exiting the procedure, Mem Usage and VM Size
remains at that level, about 4 MB over their previous values. That is
to be expected, since global storage __should__ persist between
procedure calls.

And again note that setting the ws array to Nothing does __not__
reduce memory usage. And again note that it is not even necessary to
set ws to Worksheets(1).

Mem Usage and VM Size does revert to their previous values when I
remove the global ws array declaration, for example by setting #Const
globDef to False. The memory reduction happens almost immediately,
after waiting 1-2 seconds for Task Manager to update.

(Aside.... I would expect a local Static declaration to behave the
same way as a non-local declaration. But I did not test it. "The
exercise is left to the student" ;->.)

-----

#Const globDef = False

Private Declare Function GetCurrentProcessId _
Lib "kernel32" () As Long

#If globDef Then
Dim ws(1 To 1000000) As Worksheet
#End If

Sub testit()
#If Not globDef Then
Dim ws(1 To 1000000) As Worksheet
#End If
For i = 1 To UBound(ws)
Set ws(i) = Worksheets(1)
Next
MsgBox GetCurrentProcessId() & " set to Worksheets"
For i = 1 To UBound(ws)
Set ws(i) = Nothing
Next
MsgBox GetCurrentProcessId() & " set to Nothing"
End Sub
 
<FWIW>
Here's a snippet from Professional Excel Development Ch.07 that
explains how memory leaks occur, and why it's considered *good
practice* to set objects we're done with to *= Nothing*.


<Quote>
"...Normally when you overwrite an object in VBA, VBA cleans up the old
version of the object and reclaims the memory that was used to hold it.
You can also set an object equal to Nothing to reclaim the memory used
by it. It is good practice to do this explicitly when you no longer
need an object, rather than relying on VBA to do it.



Set gclsCells = Nothing



When you create two objects that store references to each other the
system will no longer reclaim the memory they used when they are set to
new versions or when they are set to Nothing. When analyzing the
worksheet in Analysis5.xls with 574 cells in the used range, there is a
loss of about 250KB of RAM each time CreateCellsCollection is executed
during an Excel session.

NOTE

If you are running Windows NT, 2000 or XP you can check the amount of
RAM currently used by Excel by pressing Ctrl+Shift+Esc to display the
Processes window in Task Manager and examining the Mem Usage column for
the row where the Image Name column is EXCEL.EXE.



One way to avoid this problem is to make sure you remove the cross
references from the linked objects before the objects are removed. You
can do this by adding a method like the Terminate method shown in
Listing 7-15 to the problem classes, in our case the CCell class.



Listing 7-15

The Terminate Method in the CCell Class Module

Public Sub Terminate()

Set mclsParent = Nothing

End Sub



The code in Listing 7-16 is added to the CCells class module. It calls
the Terminate method of each Cell class contained in the collection to
destroy the cross reference between the classes.



Listing 7-16

The Terminate Method in the CCells Class Module

Public Sub Terminate()

Dim clsCell As CCell

For Each clsCell In mcolCells

clsCell.Terminate

Set clsCell = Nothing

Next clsCell

Set mcolCells = Nothing

End Sub..."
</Quote>
 
Peter T pretended :
You've been given many suggestions, particularly about taking advantage of
using Excel's built in functions. However which ever approach, with this type
of thing it's really worthwhile to start with both arrays sorted, typically
that's a one off task, or not much extra work when adding small amounts of
data.

Once both sorted, search for the first matching item (and further
duplicates). Then search the second item starting at index+1 of the last item
found. Obviously in any loop after finding a matching item, as soon as the
next item doesn't match (ie after any duplicates) exit that loop and start
the next. You should end up with a fraction of the amount of working you are
doing now.

If regularly trying to find individual items in a large sorted array, search
"binary chop", which will find matches in negligible time even in large
arrays.

Peter Thornton

Peter,
I think the time it takes lies in the fact that *Arrays* aren't being
compared, but rather the loops read the ranges directly.

If the 2 ranges were dumped into arrays where at least 1 array was
sorted, a binary search would cut the processing time considerably.
Better, though, (as you say) that both arrays are sorted. The
bottleneck is trying to loop the ranges directly while doing this in
memory would run orders of magnitude faster.<IMO>
 
Peter T pretended :








Peter,
I think the time it takes lies in the fact that *Arrays* aren't being
compared, but rather the loops read the ranges directly.

If the 2 ranges were dumped into arrays where at least 1 array was
sorted, a binary search would cut the processing time considerably.
Better, though, (as you say) that both arrays are sorted. The
bottleneck is trying to loop the ranges directly while doing this in
memory would run orders of magnitude faster.<IMO>

--
Garry

Free usenet access athttp://www.eternal-september.org
ClassicVB Users Regroup! comp.lang.basic.visual.misc

There's some very good information in this thread. Thank you. I have
yet to try these find and match methods. Although I did use VLOOKUP
for a search and reduced the search time (for the same 50,000 rows of
data) down to 3 seconds. In using VLOOKUP I used named ranges for the
first two input arguments, the search parameter and the lookup table.
Then I tried to do the same thing using arrays instead of ranges. It
took at least one hundred times longer. I would think that it would
be faster to sort through an array than a range, but clearly this is
not true.

thanks again,
Andrew
 
After serious thinking Andrew wrote :
I would think that it would
be faster to sort through an array than a range, but clearly this is
not true.

Well, that depends on how you do it. Clearly there are fast ways and
slow ways, and so if it took as long as you say it did then I suspect
you used a slow way. For sure, if at least one of your arrays wasn't
sorted, it would take a significant amount of time to process. Far less
time if 1 array is sorted. Orders of magnitude less time if both arrays
are sorted.

Regardless, built-in Excel functions will almost always perform much
faster than VBA will and so if Rick's suggestion works for you I'd be
inclined to use it over VBA.
 
Back
Top