Array Problem When Index Value is Nothing

  • Thread starter Thread starter Robert Bevington
  • Start date Start date
R

Robert Bevington

Hi all,

I ran into memory problems while tying to search and replace a very large
text file. To solve this I break the file up into chunks and run the search
and replace on each chunk. This works fine and has solved the OutOfMemory
problem.

However, on the last loop when the array c is written to CleanTMX, a number
of 0x00 characters are written at the end of the file. This causes problems
in a further XMLTransformation as this character is not allowed in XML. I
looked at the values of the index. Th eproblem seems to be caused by index
values at the end of the array being set to Nothing.

Question: How can I get rid of these characters? Or how can I reduce the
array to only contain index values that are not Nothing?

Here's the code that writes the CleanTMX file:

Dim c(My.Settings.ReadChunkSize) As Char 'ReadChunkSize is a user-defined
setting, normally set to 10000

Using sr As StreamReader = New StreamReader(OriginalTMX,
System.Text.Encoding.UTF8, True)
Do While sr.Peek() >= 0
sr.Read(c, 0, c.Length)
Dim i As Integer
For i = 0 To arrFind.Length - 1
c = Regex.Replace(c, arrFind(i), arrReplace(i))
Next
Try
Using sw As StreamWriter = New StreamWriter(CleanTMX, True,
System.Text.Encoding.UTF8)
sw.Write(c)
End Using

Catch ex As Exception
End Try
Loop

Would really appreciate any help on this one.

Thanx

Rob
 
Robert Bevington said:
Hi all,

I ran into memory problems while tying to search and replace a very
large text file. To solve this I break the file up into chunks and
run the search and replace on each chunk. This works fine and has
solved the OutOfMemory problem.

However, on the last loop when the array c is written to CleanTMX, a
number of 0x00 characters are written at the end of the file. This
causes problems in a further XMLTransformation as this character is
not allowed in XML. I looked at the values of the index. Th eproblem
seems to be caused by index values at the end of the array being set
to Nothing.

Question: How can I get rid of these characters? Or how can I reduce
the array to only contain index values that are not Nothing?

Here's the code that writes the CleanTMX file:

Dim c(My.Settings.ReadChunkSize) As Char 'ReadChunkSize is a
user-defined setting, normally set to 10000

Using sr As StreamReader = New StreamReader(OriginalTMX,
System.Text.Encoding.UTF8, True)
Do While sr.Peek() >= 0
sr.Read(c, 0, c.Length)
Dim i As Integer
For i = 0 To arrFind.Length - 1
c = Regex.Replace(c, arrFind(i), arrReplace(i))
Next
Try
Using sw As StreamWriter = New StreamWriter(CleanTMX, True,
System.Text.Encoding.UTF8)
sw.Write(c)
End Using

Catch ex As Exception
End Try
Loop

Would really appreciate any help on this one.

I'm not sure if it's correct in this context, but I think sr.Read
returns the number of characters read. Hence, you have to write only as
many characters as have been read.

dim CharCount as integer

charcount = sr.read(c, 0, c.length)
...
sw.write(c, 0, charcount)

I think this explains the additional characters.

However, you should reposition the file pointer after reading a chunk.
I'm not sure if that's possible using the StreamReader because of the
internal buffer, so you'd have to use a BinaryReader and do the UTF8
decoding on your own, while being able to set the file pointer
backwards. Otherwise, you will not recognize search strings that are
split across chunks boundaries. For example,

chunk #1: "Robert B"
chunk #2: "evington"

You don't find "Bev" in any of the chunks.


Armin
 
Can't you just REDIM PRESERVE to reduce the array size to get rid of the 0x00
entries?

Armin is correct that you'll miss entries on chunk boundaries, BTW. One
solution is to use the 'c' array as a buffer, appending newly read characters
to the end, taking off characters to the output stream from the beginning,
and always leaving at least n characters in 'c', where n=length of the
biggest string you are looking for (minus one).
 
Hi Armin and Surtur,

thanx guys for your replies. Having read that my "great" solution to my
problem didn't really work was a real downer for me :-) I wasa broken man
last night and went straight to bed :-) But that's what happens when
beginners start programming I suppose.

I tried the Redim Preserve. That might solve the one problem. I just need to
find the correct value for the redim.

Surtur's solution sounds interesting too. I'll look into to both.

Again thanx

Rob
 
Back
Top