Alberto said:
[...]
Loop on the file on disk reading all the rows one by one. For each row
do the following:
- Compute a hash value on the fields that you want to be unique (the
GetHashCode method should serve you well).
- Use the hash code to store in a hashtable the offset in the file where
the record that you read initially was located. This hashtable will
probably be small enough to fit into your available RAM, since it only
stores a numeric value and not your whole records.
- For each record, first look it up in the hash table. If it exists
there, Seek to the offset in the file where it is stored, read the data
and compare them to the record you are currently processing. If they
match, then the current record is a duplicate and it can be discarded.
Otherwise, continue comparing the rest of the records that have the same
hasch code, since there could be more than one. After you have compared
them all, if none match, save the record hash and offset to the internal
table and save the complete record into an output file.
IMHO, this suggestion is one of the best offered so far. It's
efficient, reasonably simple to implement, and offloads _all_ of the
actual key data to the file system (minus the 4 or 8 bytes offset into
the file, of course). Of course, it adds the cost of a Dictionary
(preferable) or Hashtable (key and value), as compared to a HashSet (key
only) so will work best with relatively long keys.
The sorting ones aren't bad either, but of course they get complicated
if the output needs to remain in the same order as the input (you have
to add a new sort-order field to the intermediate files, and then
re-sort the output once the duplicates have been removed; depending on
the reduction of size of input, this could mean yet again partitioning
the problem to sort segments of the output and then doing a merge sort
to reassemble everything).
I don't usually suggest using a database for random file manipulation
questions, but I think in this case that is also a good suggestion, just
due to the sheer simplicity of it. Access and SQL Express aren't the
only databases out there, and as "CY" points out, MySQL can be
configured to support larger databases, large enough to support your input.
Of course, the easiest solution would be to switch to a 64-bit OS. Then
you can allocate larger amounts of memory in your process, and the error
would go away.
![Smile :) :)](/styles/default/custom/smilies/smile.gif)
(With only 2GB of RAM on the computer, you'll
certainly run into disk swapping performance issues, but all the
fallback solutions will rely on some kind of disk i/o anyway, so
performance is going to suffer one way or the other using those techniques).
Another option you might try is to store your column values more
efficiently. Rather than using a HashSet<string>, use a HashSet<byte[]>
and store the raw bytes from the file rather than the UTF-16 of a
System.String instance (or re-encode them as ASCII or UTF-8, if that
works better for you). If you know that the keys contain only
characters that are in a limited set of characters, you could even
compress the bytes further, either just by reducing the
bits-per-character (treat the byte array as a bit-stream), or by using
something like Huffman encoding (simple to implement and works well on
ASCII text).
In fact, depending on the data requirements for a file offset (probably
8 bytes if your file is 4GB or larger, though you could get complicated
and store fewer than 8 bytes, throwing out the most significant bytes
that you know are always 0 for the size of the file you're dealing
with), and the actual length of a key being hashed, this approach could
work as well as or better than the "hash, storing offsets to file" approach.
Finally, you might try writing your own hash data structure, and
including the ability to set in advance the exact size of the hash
table, and maybe even pre-allocate a pool of entry data structures for
the table.
Setting the size of the table in advance could be very helpful, because
at least part of your error is likely to be caused by the fact that
HashSet has to reallocate its internal data structures as it grows, and
each time that happens, you have two copies of the data in memory at
once. The reallocation also _will_ result in fragmentation of the
large-object heap, which can result in an out-of-memory error even when
there is theoretically space in the process's virtual address space to
allocate more memory.
Pre-allocating a pool of entry data structures as an array from which
you pull indices rather than references to actual objects allows that to
also be created once (or at least many fewer times, if you decide, for
example, to allocate arrays of 1000 at a time), reducing overhead in the
memory manager.
This last suggestion is not necessarily as pointless as it might seem.
Depending on the nature of each row of data in your CSV file, it stands
to reason that the _actual_ memory cost of your hashing should be quite
a bit less than the 4GB the file requires, on the assumption that you're
checking only a small subset of the entire row (two columns out of how
many, of what kind of data?).
So, sure...you're definitely pushing the limits of what can be done in
the 2GB of virtual address space allotted to a 32-bit process (3GB if
you boot with /3GB and set the /LARGEADDRESSAWARE for your
application...I don't recall if this is set by default for managed
apps). But assuming the columns that are used to determine duplication
are a less than, say, 1/6th or 1/8th of each row of data, then it seems
likely to me a 4GB file could be able to be processed even in a 32-bit
app, without resorting to using the file system as temporary storage
(other than implicit disk swapping, of course). "All" you need is a
more efficient implementation of the data structures.
If performance is very important, it could well be worth the extra
effort to approach the problem that way, rather than pushing the problem
out to the file system (which is admittedly simpler in some ways).
Note the various proposals fall into two broad categories:
-- Will still fail for slightly larger files
-- Scales well to significantly larger files
All of the memory-based solutions, while they may address your current
needs, can still be pushed to the breaking point for files even larger
than what you're dealing with. The database and file-system-only
solutions are limited only by the file system itself, and so are likely
to continue working on MUCH larger inputs, albeit at perhaps a larger
performance cost than the memory-based solutions.
In the end, it may well be that just adding more RAM and a 64-bit OS is
the best solution.
Pete