Modifying a text file

soup_nazi · Jan 23, 2006

I want to remove duplicate entries within a text file. So if I had
this within a text file...

Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HeartBase/
Applications/HeartBase/
Applications/HHC/
Applications/HHC/
Applications/HHC/
Applications/HHC/

I would want the result to be this:

Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HHC/

I've tried using StreamReader and StreamWriter simulataneously with no
success...any other ideas?

Kevin Spencer · Jan 23, 2006

Use the StreamReader to read the lines into an array of strings. Close the
StreamReader. Loop through the array to eliminate the duplicates by
comparing each string in the array with all of the strings before it. You
can eliminate the duplicates by setting the duplicate entries to a blank
string. Write the string to the file using a StreamWriter. Don't write the
blank array members.

If your file contains blank lines, use a different string to indicate a
removed string (e.g. "[REMOVED]").

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

Peter Rilling · Jan 23, 2006

If the file is large this might be a drain on resources and cause
performance problems.

Kevin Spencer said:
Use the StreamReader to read the lines into an array of strings. Close the
StreamReader. Loop through the array to eliminate the duplicates by
comparing each string in the array with all of the strings before it. You
can eliminate the duplicates by setting the duplicate entries to a blank
string. Write the string to the file using a StreamWriter. Don't write the
blank array members.

If your file contains blank lines, use a different string to indicate a
removed string (e.g. "[REMOVED]").

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

soup_nazi said:

I want to remove duplicate entries within a text file. So if I had
this within a text file...

Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HeartBase/
Applications/HeartBase/
Applications/HHC/
Applications/HHC/
Applications/HHC/
Applications/HHC/

I would want the result to be this:

Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HHC/

I've tried using StreamReader and StreamWriter simulataneously with no
success...any other ideas?

Click to expand...

Peter Rilling · Jan 23, 2006

Question, will the duplicate entries always be next to each other?

Can you provide some code that shows how you used the reader and writer.
There just might be something wrong with your logic.

Kevin Spencer · Jan 24, 2006

If the file is large this might be a drain on resources and cause

performance problems.

If the file is *very* large, perhaps. However, I have written applications
that load hundreds of MB of data into memory without any performance issues.
Considering the sample he posted, I estimated that the size of the file is
not likely to be very large.

Other solutions that would handle very large files and check for duplicate
lines would definitely slow down performance. Disk IO is costly and slow,
especially in a managed app. When possible, it's best to read an entire file
into memory and work with it from there.

Yes, it would be possible to open a stream to the file, and read a line (or
a chunk of lines) at a time, comparing each line to another line (or chunk
of lines) read from the stream. If it were a very large file, this might be
necessary. But again, it would be costly to do so, because of the constant
disk IO involved. In addition, the constant re-allocation of strings would
consume a lot of managed memory. You'll notice that my solution did not
involve any reallocation of strings, except for the blank strings used to
replace removed strings.

Yes, my solution could be optimized a bit more. For example, rather than
replacing a string with a blank string in the array, removed strings could
be replace with null, now that I think of it.

If you have a better idea, let's hear it.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

Peter Rilling said:
If the file is large this might be a drain on resources and cause
performance problems.

Kevin Spencer said:

Use the StreamReader to read the lines into an array of strings. Close
the StreamReader. Loop through the array to eliminate the duplicates by
comparing each string in the array with all of the strings before it. You
can eliminate the duplicates by setting the duplicate entries to a blank
string. Write the string to the file using a StreamWriter. Don't write
the blank array members.

If your file contains blank lines, use a different string to indicate a
removed string (e.g. "[REMOVED]").

--
HTH,

Kevin Spencer
Microsoft MVP
.Net Developer
Who is Mighty Abbott?
A twin turret scalawag.

soup_nazi said:

I want to remove duplicate entries within a text file. So if I had
this within a text file...

Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HeartBase/
Applications/HeartBase/
Applications/HHC/
Applications/HHC/
Applications/HHC/
Applications/HHC/

I would want the result to be this:

Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HHC/

I've tried using StreamReader and StreamWriter simulataneously with no
success...any other ideas?

Click to expand...

Click to expand...

rossum · Jan 24, 2006

I want to remove duplicate entries within a text file. So if I had
this within a text file...

Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HeartBase/
Applications/HeartBase/
Applications/HHC/
Applications/HHC/
Applications/HHC/
Applications/HHC/

I would want the result to be this:

Applications/Diabetic Registry/
Applications/Great Plains/
Applications/Great Plains/Servers/
Applications/HeartBase/
Applications/HHC/

I've tried using StreamReader and StreamWriter simulataneously with no
success...any other ideas?

The usual way to remove duplicates is to load the file into memory,
sort it then run through it keeping any line that does not match the
previous line.

If the file is too big to load into memory in one piece then you will
have to look at other techniques. Either process the file in chunks
(read up on "merge sort" for ideas) or else use the structure inherent
in the example you showed. You could load the whole thing into a
tree, reducing the amount of memory used:
<ASCII art ahead - monospaced font strongly recommended>

Applications -+-> Diabetic Registry ---> end
|
+-> Great Plains -+-> end
| |
| +-> Servers ---> end
|
+-> HeartBase ---> end
|
+-> HHC ---> end

rossum