Delete matching cells

  • Thread starter Thread starter gary
  • Start date Start date
G

gary

Col A has 360,000 cells.
Col B has 240,000 cells.

A B
0000000021957 0000000022002
0000000022002 0000000032002
0000000031957 0000000032003
0000000032002 0000000042002
0000000032003 0000000052002
0000000042002 0000000052003
0000000052002 0000000062002
0000000052003 0000000102002
0000000061967 0000000121996
0000000061968 0000000142002
0000000062002 0000000152002
0000000081963 0000000162002
0000000102002 0000000481994
0000000121996 0000000481995
0000000142002 0000000481996
0000000152002 0000000481997
0000000162002 0000000481998
0000000341991 0000000481999
0000000401961 0000000482000

How can I delete the cells in Col A whose contents match cells in Col
B?
 
One method.

Insert a column left of Column A
Insert a row at top
Add titles in A1:C1..........will need for filtering


In A2 enter =COUNTIF($C$2:$C$20,B2)<>0

D-click on fill handle to copy down to bottom of Column B

Select Columns A and B only

Data>Filter>Autofilter.

Filter for True on Column A

Select from A2 to bottom of Column B.

F5>Special>Visible Cells>OK

Edit>Clear>Contents

Remove Filter.

Select A and B then F5>Special>Blanks>OK

Edit>Delete>Shift Cells Up

Done


Gord
 
You posted in the Programming group so I guess VBA is in order.

Sub test()
For Each cell In Range("A1:A360000")
If WorksheetFunction.CountIf(Range("B1:B240000"), _
cell.Value) <> 0 Then
cell.ClearContents
End If
Next
End Sub

You can get rid of the blanks after running.


Gord
 
You posted in the Programming group so I guess VBA is in order.

Sub test()
For Each cell In Range("A1:A360000")
If WorksheetFunction.CountIf(Range("B1:B240000"), _
cell.Value) <> 0  Then
cell.ClearContents
End If
Next
End Sub

You can get rid of the blanks after running.

Gord











- Show quoted text -

Hi Gord,

How long should your macro run? (It's been running for more than 2
hours). Is there any way to determine its progress?
 
I have no idea how long it should take but 2 hours+ sounds a little
much.

I ran it on about 100 rows which is somewhat smaller than the range
you have. Took a second.

See Ron's macro..........proably much faster than a loop.

Did you try the manual method?


Gord
 
Here's another macro, using the AdvancedFilter.
Please do this on a copy of your data.

You will need to set ws to the proper worksheet.  I used Sheet2.

=============================
Option Explicit
Sub PruneColA()
    Dim ws As Worksheet
    Dim rColA As Range, rColB As Range
    Dim c As Range
    Dim rCrit As Range
    Dim i As Long
    Dim v As Variant
Set ws = Worksheets("Sheet2")
With ws
    Set rColA = Range(.Cells(1, "A"), .Cells(.Rows.Count, "A").End(xlUp))
    Set rColB = Range(.Cells(1, "B"), .Cells(.Rows.Count, "B").End(xlUp))
    Set rCrit = .UsedRange.Resize(rowsize:=2, columnsize:=1).Offset _
        (columnoffset:=.UsedRange.Columns.Count + 2)
End With

Application.ScreenUpdating = False

rCrit(1).ClearContents
rCrit(2) = "=countif(" & rColB.Address & "," & rColA(2).Address(False, False) & ")>0"

With rColA
    .AdvancedFilter Action:=xlFilterInPlace, criteriarange:=rCrit
End With

rCrit.EntireColumn.Delete

On Error Resume Next
rColA.Offset(rowoffset:=1).Resize(rowsize:=rColA.Rows.Count - 1) _
    .SpecialCells(xlCellTypeVisible).ClearContents
On Error GoTo 0

i = 0
ReDim v(1 To WorksheetFunction.CountA(rColA))
For Each c In rColA
    c.EntireRow.RowHeight = 15
    If c.Value <> "" Then
        i = i + 1
        v(i) = c.Text
    End If
Next c

rColA.ClearContents
Set rColA = rColA.Resize(rowsize:=UBound(v))
rColA = WorksheetFunction.Transpose(v)

Application.ScreenUpdating = True

End Sub
===================================- Hide quoted text -

- Show quoted text -

How long should your macro run?
Is there any way to determine its progress?
 
I have no idea how long it should take but 2 hours+ sounds a little
much.

I ran it on about 100 rows which is somewhat smaller than the range
you have.  Took a second.

See Ron's macro..........proably much faster than a loop.

Did you try the manual method?

Gord





- Show quoted text -

In your =Count formula, I changed $C$20 to $C$239820 (which is the
number of cells in my spreadsheet)

When copying that formula down to the bottom of Col B, it's already
taken 30 minutes and "Calculating" is at 4%..






..
 
Here's another macro, using the AdvancedFilter.
Please do this on a copy of your data.

You will need to set ws to the proper worksheet.  I used Sheet2.

=============================
Option Explicit
Sub PruneColA()
    Dim ws As Worksheet
    Dim rColA As Range, rColB As Range
    Dim c As Range
    Dim rCrit As Range
    Dim i As Long
    Dim v As Variant
Set ws = Worksheets("Sheet2")
With ws
    Set rColA = Range(.Cells(1, "A"), .Cells(.Rows.Count, "A").End(xlUp))
    Set rColB = Range(.Cells(1, "B"), .Cells(.Rows.Count, "B").End(xlUp))
    Set rCrit = .UsedRange.Resize(rowsize:=2, columnsize:=1).Offset _
        (columnoffset:=.UsedRange.Columns.Count + 2)
End With

Application.ScreenUpdating = False

rCrit(1).ClearContents
rCrit(2) = "=countif(" & rColB.Address & "," & rColA(2).Address(False, False) & ")>0"

With rColA
    .AdvancedFilter Action:=xlFilterInPlace, criteriarange:=rCrit
End With

rCrit.EntireColumn.Delete

On Error Resume Next
rColA.Offset(rowoffset:=1).Resize(rowsize:=rColA.Rows.Count - 1) _
    .SpecialCells(xlCellTypeVisible).ClearContents
On Error GoTo 0

i = 0
ReDim v(1 To WorksheetFunction.CountA(rColA))
For Each c In rColA
    c.EntireRow.RowHeight = 15
    If c.Value <> "" Then
        i = i + 1
        v(i) = c.Text
    End If
Next c

rColA.ClearContents
Set rColA = rColA.Resize(rowsize:=UBound(v))
rColA = WorksheetFunction.Transpose(v)

Application.ScreenUpdating = True

End Sub
===================================- Hide quoted text -

- Show quoted text -


Your macro (using the Advanced Filter) is getting Run-time Error
'1004' of "AdvancedFilter method of Range clsss failed".

Note: I've set ws = Worksheets("Sheet1")
 
One way...

Option Explicit

Sub StripDupes()
Dim i&, j&, lRows1&, lRows2&, lLastRow& 'as long
Dim vRng As Variant
lRows1 = Cells(Rows.Count, "A").End(xlUp).Row
lRows2 = Cells(Rows.Count, "B").End(xlUp).Row
If lRows1 > lRows2 Then lLastRow = lRows1 Else lLastRow = lRows2
vRng = Range("A1:B" & lLastRow)
For i = UBound(vRng) To 1 Step -1
If Not vRng(i, 1) = "" Then
For j = 1 To UBound(vRng)
If vRng(i, 1) = vRng(j, 2) Then
Cells(i, 1).Delete shift:=xlUp: Exit For
End If
Next 'j
End If
Next 'i
End Sub
 
You can speed it up by turning ScreenUpdating off...
One way...

Option Explicit

Sub StripDupes()
Dim i&, j&, lRows1&, lRows2&, lLastRow& 'as long
Dim vRng As Variant
lRows1 = Cells(Rows.Count, "A").End(xlUp).Row
lRows2 = Cells(Rows.Count, "B").End(xlUp).Row
If lRows1 > lRows2 Then lLastRow = lRows1 Else lLastRow = lRows2
vRng = Range("A1:B" & lLastRow)
Application.ScreenUpdating = False
For i = UBound(vRng) To 1 Step -1
If Not vRng(i, 1) = "" Then
For j = 1 To UBound(vRng)
If vRng(i, 1) = vRng(j, 2) Then
Cells(i, 1).Delete shift:=xlUp: Exit For
End If
Next 'j
End If
Next 'i
Application.ScreenUpdating = True
 
It is not possible to say how long the macro would take to run as that isdependent not only on the code, but also on the characteristics of your particular machine and environment.

I rewrote the macro in a manner which should significantly improve the speed.  The re-written macro, run against the data that you provided which I copied and repeated down to 600,000 rows, took about five minutes to run.

However, I only disabled screen updating as there is nothing else on thatworksheet of mine.  If you have calculations which refer to the cells that are being altered, event triggered macros, and other calls on the resources, that can slow things down.  All of those things can be disabled if they are an issue, but for now let's see if we can't get something running.

The way it has been rewritten, there is no way to determine its progress. The implementation of AdvancedFilter is an Excel feature, not VBA.  Although it should execute much more quickly than looping through the rows of cells, I don't know of any way to monitor its progress.

However, there are issues with speed, function, and the AdvancedFilter inExcel especially when dealing with large data bases.  I have developed amethod which I think should run MUCH more quickly, but it is hard to test on a small database.  It does assume that there are no duplicates in Column A, or if there are that you only want to display unique values.  Is that a valid assumption?

The method needs a little refinement but with your database duplicated down to about 500,000 rows, it runs in less than five seconds.  Of course, there are only seven entries in ColA that do not appear in Colb.  I have not idea how it would run with a different data set.  But try it and let me know.

Also, with this method, it would be possible to keep track of where it is..

Note the comment at the beginning about setting a reference.  It will NOT run if that reference isn't set.  If this will be distributed, we can use late binding, but not tonight.

===================================
Option Explicit
Sub PruneColA()
'Requires setting reference (tools/references) to
'  Microsoft Scripting Runtime

    Dim ws As Worksheet
    Dim rColA As Range, rColB As Range
    Dim vColA As Variant, vColB As Variant
    Dim dColA As Dictionary, dColB As Dictionary
    Dim i As Long
    Dim d As Variant

Set dColA = New Dictionary
Set dColB = New Dictionary
Set ws = Worksheets("Sheet2")
With ws
    Set rColA = Range(.Cells(1, "A"), .Cells(.Rows.Count, "A").End(xlUp))
    Set rColB = Range(.Cells(1, "B"), .Cells(.Rows.Count, "B").End(xlUp))
End With

vColB = rColB
vColA = rColA

For i = LBound(vColB, 1) + 1 To UBound(vColB, 1)
    With dColB
        If Not .Exists(Key:=vColB(i, 1)) Then .Add Key:=vColB(i, 1), Item:=vColB(i, 1)
    End With
Next i

For i = LBound(vColA, 1) + 1 To UBound(vColA, 1)
    If Not dColB.Exists(Key:=vColA(i, 1)) Then
        With dColA
            If Not .Exists(Key:=vColA(i, 1)) Then .Add Key:=vColA(i, 1), Item:=vColA(i, 1)
        End With
    End If
Next i

ReDim vColA(1 To dColA.Count)
i = 0
For Each d In dColA
    i = i + 1
    vColA(i) = dColA(d)
Next d

rColA.Offset(rowoffset:=1).ClearContents
Set rColA = rColA.Resize(rowsize:=dColA.Count).Offset(rowoffset:=1)
rColA = WorksheetFunction.Transpose(vColA)

End Sub
+++++++++++++++++++++++++++++++


Run-time Error '13'
Type Mismatch in

rColA = WorksheetFunction.Transpose(vColA)
 
It is not possible to say how long the macro would take to run as that isdependent not only on the code, but also on the characteristics of your particular machine and environment.

I rewrote the macro in a manner which should significantly improve the speed.  The re-written macro, run against the data that you provided which I copied and repeated down to 600,000 rows, took about five minutes to run.

However, I only disabled screen updating as there is nothing else on thatworksheet of mine.  If you have calculations which refer to the cells that are being altered, event triggered macros, and other calls on the resources, that can slow things down.  All of those things can be disabled if they are an issue, but for now let's see if we can't get something running.

The way it has been rewritten, there is no way to determine its progress. The implementation of AdvancedFilter is an Excel feature, not VBA.  Although it should execute much more quickly than looping through the rows of cells, I don't know of any way to monitor its progress.

However, there are issues with speed, function, and the AdvancedFilter inExcel especially when dealing with large data bases.  I have developed amethod which I think should run MUCH more quickly, but it is hard to test on a small database.  It does assume that there are no duplicates in Column A, or if there are that you only want to display unique values.  Is that a valid assumption?

The method needs a little refinement but with your database duplicated down to about 500,000 rows, it runs in less than five seconds.  Of course, there are only seven entries in ColA that do not appear in Colb.  I have not idea how it would run with a different data set.  But try it and let me know.

Also, with this method, it would be possible to keep track of where it is..

Note the comment at the beginning about setting a reference.  It will NOT run if that reference isn't set.  If this will be distributed, we can use late binding, but not tonight.

===================================
Option Explicit
Sub PruneColA()
'Requires setting reference (tools/references) to
'  Microsoft Scripting Runtime

    Dim ws As Worksheet
    Dim rColA As Range, rColB As Range
    Dim vColA As Variant, vColB As Variant
    Dim dColA As Dictionary, dColB As Dictionary
    Dim i As Long
    Dim d As Variant

Set dColA = New Dictionary
Set dColB = New Dictionary
Set ws = Worksheets("Sheet2")
With ws
    Set rColA = Range(.Cells(1, "A"), .Cells(.Rows.Count, "A").End(xlUp))
    Set rColB = Range(.Cells(1, "B"), .Cells(.Rows.Count, "B").End(xlUp))
End With

vColB = rColB
vColA = rColA

For i = LBound(vColB, 1) + 1 To UBound(vColB, 1)
    With dColB
        If Not .Exists(Key:=vColB(i, 1)) Then .Add Key:=vColB(i, 1), Item:=vColB(i, 1)
    End With
Next i

For i = LBound(vColA, 1) + 1 To UBound(vColA, 1)
    If Not dColB.Exists(Key:=vColA(i, 1)) Then
        With dColA
            If Not .Exists(Key:=vColA(i, 1)) Then .Add Key:=vColA(i, 1), Item:=vColA(i, 1)
        End With
    End If
Next i

ReDim vColA(1 To dColA.Count)
i = 0
For Each d In dColA
    i = i + 1
    vColA(i) = dColA(d)
Next d

rColA.Offset(rowoffset:=1).ClearContents
Set rColA = rColA.Resize(rowsize:=dColA.Count).Offset(rowoffset:=1)
rColA = WorksheetFunction.Transpose(vColA)

End Sub
+++++++++++++++++++++++++++++++

I have deleted the duplicates in col A and in Col B.

When I re-ran your macro, I still got the Run-time Error '13'
Type Mismatch in

rColA = WorksheetFunction.Transpose(vColA)
 
Empirically testing this theory in my Excel 2007 reveals that worksheetfunction.transpose works OK with 2^16 elements, but returns the type mismatcherror with 2^17 elements, so that is probably why you ran into that error. Avoiding transpose, as I did, should fix it (unless we run into a different limit).

Hi Ron,

I ran without transpose and it finished in about 1 minute!

In Col A, a cell contains 0507811951990
That value is not in Col B.
Your macro is displaying it as 5.07811E+11
How can I get it to be displayed as 0507811951990 ?
 
Hi Ron,

My spreadsheet has:

A B
0000000021957 0000000022002
0000000022002 0000000032002
0000000031957 0000000032003
0000000032002 0000000042002
0000000032003 0000000052002
0000000042002 0000000052003
0000000052002 0000000062002
0000000052003 0000000102002
0000000061967 0000000121996
0000000061968 0000000142002
0000000062002 0000000152002
0000000081963 0000000162002

Your macro (without Transpose) returns this:

0000000021957
22002
31957
61967
61968
81963
341991
401961
431978
482010
482011

In my spreadsheet:
A2 contains 0000000022002
B1 contains 0000000022002

But your macro results contains 22002
Why are the leading zeroes being dropped?
 
You can further optimize performance as follows...

Option Explicit

Sub StripDupes()
Dim i&, j&, lRows1&, lRows2&, lLastRow&, lCalcMode& 'as long
Dim vRng As Variant, bEventsEnabled As Boolean

lRows1 = Cells(Rows.Count, "A").End(xlUp).Row
lRows2 = Cells(Rows.Count, "B").End(xlUp).Row
If lRows1 > lRows2 Then lLastRow = lRows1 Else lLastRow = lRows2
vRng = Range("A1:B" & lLastRow)

With Application
lCalcMode = .Calculation: .Calculation = xlCalculationManual
bEventsEnabled = .EnableEvents: .EnableEvents = False
.ScreenUpdating = False
End With

For i = UBound(vRng) To 1 Step -1
If Not vRng(i, 1) = "" Then
For j = 1 To UBound(vRng)
If vRng(i, 1) = vRng(j, 2) Then
Cells(i, 1).Delete shift:=xlUp: Exit For
End If
Next 'j
End If
Next 'i

'Cleanup
With Application
.Calculation = lCalcMode
.EnableEvents = bEventsEnabled
.ScreenUpdating = True
End With
End Sub
 
Well, in Excel 2007 I just made up a test case of 500,000 entries in col a
and col b. There were about 238,000 in col a that were not in col b and the
last version worked fine and executed in about 1 minute.

Ron,
I'd appreciate feedback on using my (3rd) posted code on your test
data! I extracted the concept of using the array approach from an app I
have for filtering out rows of data from a data logging output file.
This requires at least xl12 to work due to the amount of data being
just under 1GB. I believe the limit on array size is 2GB but since my
app uses its own instance of Excel there's nothing else running in its
memory space.
 
I'm glad to hear that the macro is working and not taking hours :-)
The leading zero's are being dropped because Excel is trying to be helpful and interpreting the data as numeric.  We have two choices to change this and retain the speed:
  We can format the column as text.
  We can custom format the column to "0000000000000"  (thirteen zero's)
The latter retains the numeric characteristics; the former does not, but some Excel functions will still interpret this as a number.  The choice is yours.

Here's how to modify the code to provide for that.  Note the lines nearthe bottom.

===================================
Option Explicit
Sub PruneColA()
'Requires setting reference (tools/references) to
'  Microsoft Scripting Runtime

    Dim ws As Worksheet
    Dim rColA As Range, rColB As Range
    Dim vColA As Variant, vColB As Variant
    Dim dColA As Dictionary, dColB As Dictionary
    Dim i As Long
    Dim d As Variant

Set dColA = New Dictionary
Set dColB = New Dictionary
Set ws = ActiveSheet
With ws
    Set rColA = Range(.Cells(1, "A"), .Cells(.Rows.Count, "A").End(xlUp))
    Set rColB = Range(.Cells(1, "B"), .Cells(.Rows.Count, "B").End(xlUp))
End With

vColB = rColB
vColA = rColA

For i = LBound(vColB, 1) + 1 To UBound(vColB, 1)
    With dColB
        If Not .Exists(Key:=vColB(i, 1)) Then .Add Key:=vColB(i, 1), Item:=vColB(i, 1)
    End With
Next i

For i = LBound(vColA, 1) + 1 To UBound(vColA, 1)
    If Not dColB.Exists(Key:=vColA(i, 1)) Then
        With dColA
            If Not .Exists(Key:=vColA(i, 1)) Then .Add Key:=vColA(i, 1), Item:=vColA(i, 1)
        End With
    End If
Next i

ReDim vColA(1 To dColA.Count, 1 To 1)
i = 0
For Each d In dColA
    i = i + 1
    vColA(i, 1) = dColA(d)
Next d

rColA.Offset(rowoffset:=1).ClearContents
Set rColA = rColA.Resize(rowsize:=dColA.Count).Offset(rowoffset:=1)

'UNcomment one or the other of the next two lines depending on your preference
'rColA.EntireColumn.NumberFormat = "0000000000000"
rColA.EntireColumn.NumberFormat = "@"

rColA = vColA

End Sub
=======================================- Hide quoted text -

- Show quoted text -

I'm using:

rColA.EntireColumn.NumberFormat = "0000000000000"
'rColA.EntireColumn.NumberFormat = "@"

Because the result still contains 0000000022002 (which is in Col B)
and this makes the result suspect.
 
Gary,

When I interrupted it it had been running for 219 seconds.  At that point in time it had eliminated 617 entries from the column A list.

I then started up my "dictionary" routine.  It ran for about 58.6 seconds and eliminated the remaining 260,493 duplicated entries.

To set up the sample data, I enter a formula like:

A1 & B1:  =text(randbetween(1,10^6),"0000000000000")
   Fill down to row 500,000.
  Then copy/Paste Values

For timing I use the HiRes timer.

I initially tried an approach like yours:
   Examine each cell
   If the data is invalid, delete the cell and rearrange the rest (delete xlshiftup)

After some thought, I decided it should be faster to
   Get all the good data into a sequential array.
   Delete ALL the original data
   Write back the good data array.

The approach I used, using the dictionary, works pretty fast.  It's disadvantage is that if duplicates in the original data should be retained, itwould have to be modified.  (i.e. if there are multiple 0000000123456's in column A, and none of that value in Column B, and the multiple values all need to be retained in column A; and they need to be retained in their original order).  Fortunately, that is not the case.

And if I had Excel 2010, the Advanced Filter might work.  I would filter/copy; then delete the original and write back the copy.  That would work even with duplicates.  But it won't work in Excel 2007 with this data base (and seems to run  slower even with smaller databases).

==========================

I'm using:

rColA.EntireColumn.NumberFormat = "0000000000000"
'rColA.EntireColumn.NumberFormat = "@"

But the results have:

0000000021957 in A1 (but that value is NOT in Col B)
0000000022002 in A2 (but that value IS in Col B).

and this makes the results suspect.
 
Ron,
For what it is worth...
Since, a variant filled with an array of Range.Values is one based and a Dictionary object is one
based,
the two code lines below should probably omit the "+ 1".

For i = LBound(vColB, 1) + 1 To UBound(vColB, 1)
For i = LBound(vColA, 1) + 1 To UBound(vColA, 1)
'---
Jim Cone
Portland, Oregon USA
http://www.mediafire.com/PrimitiveSoftware
(XL Companion add-in: compares, matches, counts, lists, finds, deletes...)






"Ron Rosenfeld" <[email protected]>
wrote in message
news:o[email protected]...
 
Back
Top