Compare two files?

  • Thread starter Thread starter Klaus Jensen
  • Start date Start date
K

Klaus Jensen

Hi!

I need to compare two textfiles and find out IF they are dirrefent (how they
are different does not matter). I need to do it as quickly as possible,
since thousands of files are going to be compared.

I basicly need to do what the dos-program "fc" (filecompare) does.

What is the best way to do that?

Thanks in advance

Klaus
 
I need to compare two textfiles and find out IF they are dirrefent (how they
are different does not matter). I need to do it as quickly as possible,
since thousands of files are going to be compared.

I basicly need to do what the dos-program "fc" (filecompare) does.

What is the best way to do that?

First check the size of the files. If they're the same, open streams to
both files. Keep two buffers, one for each stream. When a buffer is
exhausted, read another chunk from the appropriate file. Compare each
byte within the buffers.
 
Jon Skeet said:
First check the size of the files. If they're the same, open streams to
both files. Keep two buffers, one for each stream. When a buffer is
exhausted, read another chunk from the appropriate file. Compare each
byte within the buffers.

Great input, Jon. :)

I wrote/found some of this function somewhere. It is quite slow though. :( I
was hoping for some sort of hash-based comparison... Not sure how it would
be done. Comparing the files byte by byte is not very likely to get much
faster. :(

Function CompareFiles(ByVal file1 As String, ByVal file2 As String) As
Boolean
Dim byte1 As New Byte
Dim byte2 As New Byte
Dim stream1 As New System.IO.BinaryReader(IO.File.OpenRead(file1))
Dim stream2 As New System.IO.BinaryReader(IO.File.OpenRead(file2))
Do While stream1.BaseStream.Position < stream1.BaseStream.Length
byte1 = stream1.ReadByte
byte2 = stream2.ReadByte
If byte1 <> byte2 Then
Return False
Exit Do
End If
Loop
Return True
End Function
 
Great input, Jon. :)

I wrote/found some of this function somewhere. It is quite slow though. :( I
was hoping for some sort of hash-based comparison...

A hash-based comparison is going to be no faster, as the hash would
need to take every byte into account to start with.
Not sure how it would be done. Comparing the files byte by byte is
not very likely to get much faster. :(

It's going to be a lot faster than the code you posted below:
Function CompareFiles(ByVal file1 As String, ByVal file2 As String) As
Boolean
Dim byte1 As New Byte
Dim byte2 As New Byte
Dim stream1 As New System.IO.BinaryReader(IO.File.OpenRead(file1))
Dim stream2 As New System.IO.BinaryReader(IO.File.OpenRead(file2))
Do While stream1.BaseStream.Position < stream1.BaseStream.Length
byte1 = stream1.ReadByte
byte2 = stream2.ReadByte
If byte1 <> byte2 Then
Return False
Exit Do
End If
Loop
Return True
End Function

Firstly, calling a method for every byte is a bad idea. Reading in
blocks is likely to make things *much* fastter.

Secondly, the above doesn't do the test to start with about file sizes.

Thirdly, there's no reason to go through a BinaryReader here - it's
just an extra level of indirection to slow things down.

Here's some code in C# which should be significantly faster:

using System;
using System.IO;

class Test
{
static bool FileCompare (string file1, string file2)
{
if (new FileInfo(file1).Length != new FileInfo(file2).Length)
{
return false;
}

using (FileStream s1 = new FileStream(file1, FileMode.Open),
s2 = new FileStream(file2, FileMode.Open))
{
return StreamCompare (s1, s2);
}
}

const int BufferSize = 32768;
static bool StreamCompare (Stream s1, Stream s2)
{
byte[] buffer1 = new byte[BufferSize];
byte[] buffer2 = new byte[BufferSize];

int buffer1Remaining=0;
int buffer2Remaining=0;
int buffer1Index=0;
int buffer2Index=0;

while (true)
{
if (buffer1Remaining==0)
{
buffer1Index=0;
buffer1Remaining=s1.Read(buffer1, 0, BufferSize);
}
if (buffer2Remaining==0)
{
buffer2Index=0;
buffer2Remaining=s2.Read(buffer2, 0, BufferSize);
}

// End of both streams simultaneously
if (buffer1Remaining==0 && buffer2Remaining==0)
{
return true;
}
// One stream ended before the other
if (buffer1Remaining==0 || buffer2Remaining==0)
{
return false;
}

int compareSize = Math.Min(buffer1Remaining,
buffer2Remaining);
for (int i=0; i < compareSize; i++)
{
if (buffer1[buffer1Index] != buffer2[buffer2Index])
{
return false;
}
buffer1Index++;
buffer2Index++;
}

buffer1Remaining -= compareSize;
buffer2Remaining -= compareSize;
}
}
}
 
Jon Skeet said:
It's going to be a lot faster than the code you posted below:

Ok. I'll give it a shot. :)
Secondly, the above doesn't do the test to start with about file sizes.

In my case 9999 times out of 10000, the size would be the same. Doing the
size check would therefore not improve performance more than marginally.
That is why I initially left it out.
Thirdly, there's no reason to go through a BinaryReader here - it's
just an extra level of indirection to slow things down.

Here's some code in C# which should be significantly faster:
[snip]

WHAT an increase in speed! Thanks Jon!! :)

I did a test with two 91kb identical files. I compared using the funtion I
posted earlier:

Average runtime per compare: 0.35sec

And then using your function:

Average runtime per compare: 0.0009sec [woooooooot]

Thanks a lot! :)

I included the vb-code below if anyone else is interested.

Function CompareFiles(ByVal file1 As String, ByVal file2 As String) As
Boolean
Const BufferSize As Integer = 32768
Dim fileStream1 As New System.IO.FileStream(file1, IO.FileMode.Open)
Dim fileStream2 As New System.IO.FileStream(file2, IO.FileMode.Open)
Dim buffer1() As Byte
Dim buffer2() As Byte
buffer1 = New Byte(BufferSize) {}
buffer2 = New Byte(BufferSize) {}
Dim buffer1Remaining As Integer = 0
Dim buffer2Remaining As Integer = 0
Dim buffer1Index As Integer = 0
Dim buffer2Index As Integer = 0
While True
If buffer1Remaining = 0 Then
buffer1Index = 0
buffer1Remaining = fileStream1.Read(buffer1, 0, BufferSize)
End If
If buffer2Remaining = 0 Then
buffer2Index = 0
buffer2Remaining = fileStream1.Read(buffer2, 0, BufferSize)
End If
If buffer1Remaining = 0 And buffer2Remaining = 0 Then
fileStream1.Close()
fileStream2.Close()
Return True
End If
If buffer1Remaining = 0 Or buffer2Remaining = 0 Then
fileStream1.Close()
fileStream2.Close()
Return False
End If
Dim compareSize As Integer = Math.Min(buffer1Remaining, buffer2Remaining)
For i As Integer = 0 To compareSize
i += 1
If (buffer1(buffer1Index) <> buffer2(buffer2Index)) Then
fileStream1.Close()
fileStream2.Close()
Return False
End If
buffer1Index += 1
buffer1Index += 2
Next
fileStream1.Close()
fileStream2.Close()
End While
End Function
 
<"Klaus Jensen" <CurseThemNastySpammers!>> wrote:

WHAT an increase in speed! Thanks Jon!! :)

<snip>

No problem :)
I included the vb-code below if anyone else is interested.

You should really use a Finally block for the filestream closing - that
way you don't need to pepper your code with calls to Close, and the
streams get closed even if an error occurs.
 
"Klaus Jensen" <CurseThemNastySpammers!> wrote in message

There was several errors in the vb-code I posted earlier. I have corrected
them and rerun the test. Each compare of the 91kb file now increased a bit
to 0.0017sec/compare. Still pretty amazing, thanks Jon! :)

Function CompareFiles(ByVal file1 As String, ByVal file2 As String) As
Boolean

Const BufferSize As Integer = 32768

Dim fileStream1 As System.IO.FileStream

Dim fileStream2 As System.IO.FileStream

Try

fileStream1 = New System.IO.FileStream(file1, IO.FileMode.Open)

fileStream2 = New System.IO.FileStream(file2, IO.FileMode.Open)

Dim buffer1() As Byte

Dim buffer2() As Byte

buffer1 = New Byte(BufferSize) {}

buffer2 = New Byte(BufferSize) {}

Dim buffer1Remaining As Integer = 0

Dim buffer2Remaining As Integer = 0

Dim buffer1Index As Integer = 0

Dim buffer2Index As Integer = 0

While True

If buffer1Remaining = 0 Then

buffer1Index = 0

buffer1Remaining = fileStream1.Read(buffer1, 0, BufferSize)

End If

If buffer2Remaining = 0 Then

buffer2Index = 0

buffer2Remaining = fileStream2.Read(buffer2, 0, BufferSize)

End If

If buffer1Remaining = 0 And buffer2Remaining = 0 Then

Return True

End If

If buffer1Remaining = 0 Or buffer2Remaining = 0 Then

Return False

End If

Dim compareSize As Integer = Math.Min(buffer1Remaining, buffer2Remaining)

For i As Integer = 0 To compareSize - 1

If (buffer1(buffer1Index) <> buffer2(buffer2Index)) Then

Return False

End If

buffer1Index += 1

buffer2Index += 1

Next

buffer1Remaining -= compareSize

buffer2Remaining -= compareSize

End While

Finally

fileStream1.Close()

fileStream2.Close()

End Try

End Function
 
Tim said:
It's a shame you can't get at the CRC value; it would be a useful shortcut to test inequality.

The key issue is when would the CRC get calculated in the first place?

Since calculating a CRC would require running through the whole file,
it's clearly better to compare the files until you find a difference.
Only in the worst case (the files are the same) do you have to read
through the whole file. If you're using CRCs, you always have to read
through the whole file.

Now, if you can get the CRC because it's already been calculated and
cached somewhere, that might change the costs.

However, the file system does not keep CRCs lying around, and I don't
think people would want all file changes to require that the system
recalculate CRCs.

--Tim

:

It's going to be a lot faster than the code you posted below:

Ok. I'll give it a shot. :)
Secondly, the above doesn't do the test to start with about file sizes.

In my case 9999 times out of 10000, the size would be the same. Doing the
size check would therefore not improve performance more than marginally.
That is why I initially left it out.

Thirdly, there's no reason to go through a BinaryReader here - it's
just an extra level of indirection to slow things down.

Here's some code in C# which should be significantly faster:

[snip]

WHAT an increase in speed! Thanks Jon!! :)

I did a test with two 91kb identical files. I compared using the funtion I
posted earlier:

Average runtime per compare: 0.35sec

And then using your function:

Average runtime per compare: 0.0009sec [woooooooot]

Thanks a lot! :)

I included the vb-code below if anyone else is interested.

Function CompareFiles(ByVal file1 As String, ByVal file2 As String) As
Boolean
Const BufferSize As Integer = 32768
Dim fileStream1 As New System.IO.FileStream(file1, IO.FileMode.Open)
Dim fileStream2 As New System.IO.FileStream(file2, IO.FileMode.Open)
Dim buffer1() As Byte
Dim buffer2() As Byte
buffer1 = New Byte(BufferSize) {}
buffer2 = New Byte(BufferSize) {}
Dim buffer1Remaining As Integer = 0
Dim buffer2Remaining As Integer = 0
Dim buffer1Index As Integer = 0
Dim buffer2Index As Integer = 0
While True
If buffer1Remaining = 0 Then
buffer1Index = 0
buffer1Remaining = fileStream1.Read(buffer1, 0, BufferSize)
End If
If buffer2Remaining = 0 Then
buffer2Index = 0
buffer2Remaining = fileStream1.Read(buffer2, 0, BufferSize)
End If
If buffer1Remaining = 0 And buffer2Remaining = 0 Then
fileStream1.Close()
fileStream2.Close()
Return True
End If
If buffer1Remaining = 0 Or buffer2Remaining = 0 Then
fileStream1.Close()
fileStream2.Close()
Return False
End If
Dim compareSize As Integer = Math.Min(buffer1Remaining, buffer2Remaining)
For i As Integer = 0 To compareSize
i += 1
If (buffer1(buffer1Index) <> buffer2(buffer2Index)) Then
fileStream1.Close()
fileStream2.Close()
Return False
End If
buffer1Index += 1
buffer1Index += 2
Next
fileStream1.Close()
fileStream2.Close()
End While
End Function
 
Back
Top