Variable String Limit

  • Thread starter Thread starter Freddy Coal
  • Start date Start date
F

Freddy Coal

Hi, I have an strange error;

I have a 200Mb txt file, for speed I try to load that in memory (my PC have
3Gb in Ram with Win XP 32Bit), my problem is trying to load that in an
string variable, I get the next error:

Exception of type 'System.OutOfMemoryException' was thrown.

My question is if the memory limit for the string is 2^31 (2.147.483.647
more than 2Gb) why I can get that error when try to load a file of less than
the theorical limit of the 32bit operating system.

For read the file I'm using the next code:

Dim Lector As New StreamReader(Path_File, True)
Dim Cadena as string

Cadena = Lector.ReadToEnd

In the moment I read the file line by line, but this process is very slow,
because I need process each line instead of all the block (for example for
remove quotes or get data with Regex), if I can load all the file, I can
apply regex for make multiples process.

Thanks in advance for any help.

Freddy Coal
 
Freddy said:
Hi, I have an strange error;

I have a 200Mb txt file, for speed I try to load that in memory (my
PC have 3Gb in Ram with Win XP 32Bit), my problem is trying to load
that in an string variable, I get the next error:

Exception of type 'System.OutOfMemoryException' was thrown.

My question is if the memory limit for the string is 2^31
(2.147.483.647 more than 2Gb) why I can get that error when try to
load a file of less than the theorical limit of the 32bit operating
system.
For read the file I'm using the next code:

Dim Lector As New StreamReader(Path_File, True)
Dim Cadena as string

Cadena = Lector.ReadToEnd

In the moment I read the file line by line, but this process is very
slow, because I need process each line instead of all the block (for
example for remove quotes or get data with Regex), if I can load all
the file, I can apply regex for make multiples process.

I get the same error with the overloaded constructor you use. If I specify
the encoding (Unicode), there's no problem. Which character encoding is used
in the file? Is it possible you also specify the correct encoding?


Armin
 
Freddy Coal said:
Hi, I have an strange error;

I have a 200Mb txt file, for speed I try to load that in memory (my PC
have 3Gb in Ram with Win XP 32Bit), my problem is trying to load that in
an string variable, I get the next error:

Exception of type 'System.OutOfMemoryException' was thrown.

The 200MB could become 400MB if the source is ANSI. Also, the out of memory
error could be because there no contiguous block of memory of that size. You
may have 1 GB free, but fragmented.
 
Others answered on the ANSI/UNICODE limits. I'm commenting on a design
alternative since you mentioned reading from disk and slowness.

I'm relative new to VB.NET so if I had this large text file need (and
I will) for VB.NET, I would naturally look for Memory Map I/O support.

A memory map is virtualized between the DISK and MEMORY
so its tons faster than reading from disk only, can't even tell the
difference unless you profiled it, and really not the much slower than
getting all into memory which will probably create other scalability
and performance issues anyway, increase page faults and GC stress.

If .NET already has library support for it, then you might want to
check that out. I did a quick search for it last week but didn't find
anything. Let me research again..... OH WONDERFUL!

It appears .NET 4.0 has support for a new System library:

System.IO.MemoryMappedFile

and best I can see, the only MSDN search result for that is:

http://blogs.msdn.com/salvapatuel/archive/2009/06/08/working-with-memory-mapped-files-in-net-4.aspx

In theory, all allocated memory is virtualized anyway, but not in your
control, i.e, like a big string.

I wonder if MemoryStream is .NET version of a memory map. Reading MSDN
docs for MemoryStream is not quite yelling that out but I suspect the
underlining implementation is memory mapped. :-)

But as you can see from the example in this blog, the easy of
creating/opening one.

This is going to be a god-send for VB.NET developers! Large file
names will be optimized for VB.NET now!

I'm still going to see if I can write a CMemoryMapFile class one for
..NET 2.0. :-)
 
Thanks for all the answers, I get some results reading the file with the
next code:

Dim cadena As String = ""
Dim dato As Array
If File.Exists(ruta) = True Then
dato = My.Computer.FileSystem.ReadAllBytes(ruta)
cadena = System.Text.Encoding.GetEncoding(0).GetString(dato)
Return cadena
End If

Now I have the memory error when I try to split the string in an array,
which is the limit for the array?

Thanks in advance.

Freddy Coal
 
My Engineering Opinion:

This is an inefficient method for operating on a large block of memory
using a String class.

While you might be able to LOAD a huge string, working with it is
another story because it assumes all sorts of STRING related working
relationships, unbound limits, temporary duplication, holding space of
memory. With a high frequency of such temporary operations, it is
highly inefficient and performance hits are realized.

If this was C/C++ you have more power using pointers, so maybe you an
emulate the same functionality in VB.NET. It really all depends on
what you want to do and based on my high loading product experience,
trying to work with a huge string, especially one that is managed and
wrapped with OOPS, well, you just wouldn't do it unless you had a set
practical limit. 200mb? you are asking for a lot IMO. Loading is
one thing, working with it is entirely different set of issues.

I know there is a tendency to use the tools, like a String class, to
handle any kind of length requirements. I'm fall into that too, but
that is not practical in many huge data cases. You need limits and a
working knowledge of how data is manipulated in memory by all these
higher level wrappers, classes, functions and methods.

The point?

There are huge data/array solutions but you need to do more than just
use the basic classes provided to you. For example, virtualize it
using a memory map, working in clusters/blocks of data, more usage of
pointers, in fact, I think there is a solution here using a custom
stream class.

..NET 4.0 now includes a System.IO.MemoryMappedFile class. That is
going to do wonders for high-end VB.NET development. I really wish MS
would provide it to .NET 2.0 environments.
 
Freddy said:
Thanks for all the answers, I get some results reading the file with
the next code:

Dim cadena As String = ""
Dim dato As Array
If File.Exists(ruta) = True Then
dato = My.Computer.FileSystem.ReadAllBytes(ruta)
cadena = System.Text.Encoding.GetEncoding(0).GetString(dato)
Return cadena
End If

Going the longer way by using My.Crap usually does not help. You should
better make a straight call to IO.File.ReadAllBytes.

Then you are using GetEncoding(0) which returns the default encoding.
System.Text.Encoding.Default would to the same.

But the main problem you have is that you don't seem to be aware of the
encoding that is used in the file. This time it's the Default encoding, last
time you've passed detectEncodingFromByteOrderMarks = true to the
Streamreader. Which one is correct?

I suspect that it's not even a pure text file that you want to read, is it?
Now I have the memory error when I try to split the string in an
array, which is the limit for the array?

System.IO.File.ReadAllLines should work fine if you pass the appropriate
encoding.


Armin
 
Mike, thank very much for your response (you make me improve my code : ) ),
in the moment I'm learning...

I think to that load all in memory is inefficient (and require a very robust
machine), but I don't know the better way for read the file, I don't know if
read the file line by line (in the moment I make that) is more fast that
load all in memory and process that in arrays, and I don't know how read the
text file in 'blocks' where each block it's the integration of different
lines with the same value of time inside; get that 'blocks' its very easy
with tools like MatchCollection when you have all in the string, but when
you have all in a file, the only solution in my ignorance is read the file
line by line.

The example of my text file is something like:

"1","07/01/2008
16:03:23.304","2:08:25.26N","76:38:38.27W","","8","869800000.00","10400000.00","-95.50"
"2","07/01/2008
16:03:23.304","2:08:25.26N","76:38:38.27W","","8","869800000.00","10400000.00","-89.88"
"3","07/01/2008
16:03:23.304","2:08:25.26N","76:38:38.27W","","8","869800000.00","10400000.00","-79.75"

The most important parameter for my is the last column of each line, and the
date/time, I gather all the values with the same time, and with that get a
trace. Many of the other values are the same in all the txt file.

Mike, thanks for your time, and any advice it's welcome

Freddy Coal
 
Thanks Armin, Yes and No, I use that piece of code for read other files,
that not are pure text, but the size is minimun (less than 3Mb), and that
code work great for me.

You are right in your comment, the encoding is very important in some cases,
but this not the case.

Thanks for your comments Armin.

Freddy Coal
 
FC said:
Thanks Armin, Yes and No, I use that piece of code for read other
files, that not are pure text, but the size is minimun (less than
3Mb), and that code work great for me.

You are right in your comment, the encoding is very important in some
cases, but this not the case.

I don't understand you because, if I specify the correct encoding, I can
call ReadAllLines without an exception.
Thanks for your comments Armin.


Armin
 
Freddy said:
I think to that load all in memory is inefficient (and require a very robust
machine), but I don't know the better way for read the file, I don't know if
read the file line by line (in the moment I make that) is more fast that
load all in memory and process that in arrays,

There shouldn't really be that much difference in speed, and your main
problem is your memory usage.

If you read the file into a byte array, then decode it, then split it
into lines, that means that you are using 1 GB of memory to read a 200
MB file. You should clearly do a basic processing of the data while
reading the stream, so that you don't have three copies of all the data
at once in memory.
and I don't know how read the
text file in 'blocks' where each block it's the integration of different
lines with the same value of time inside;

You can't. Files are not line oriented (or even character oriented), you
can't do any line based operations on a file.

Use a StreamReader to read the file line by line. The FileStream will
buffer the input, and the StreamReader will handle decoding and
detecting line breaks.
get that 'blocks' its very easy
with tools like MatchCollection when you have all in the string, but when
you have all in a file, the only solution in my ignorance is read the file
line by line.

There isn't reasonably any other way to do it for such a large file.
The example of my text file is something like:

"1","07/01/2008
16:03:23.304","2:08:25.26N","76:38:38.27W","","8","869800000.00","10400000.00","-95.50"
"2","07/01/2008
16:03:23.304","2:08:25.26N","76:38:38.27W","","8","869800000.00","10400000.00","-89.88"
"3","07/01/2008
16:03:23.304","2:08:25.26N","76:38:38.27W","","8","869800000.00","10400000.00","-79.75"

The most important parameter for my is the last column of each line, and the
date/time, I gather all the values with the same time, and with that get a
trace. Many of the other values are the same in all the txt file.

You should read the lines and parse each line into an object, which you
then can easily work with. When you parse a string into numerical data,
it will also take upp less memory. A string holding one line will take
up about 220 bytes, while an object holding the parsed data would take
up about 70 bytes.

The class for parsing and holding the data could look something like
this (guessing wildly about what the data is actually for...):

Public Class TempData

Private _id As Integer;
Private _time As Date;
Private _latitude As Double;
Private _longitude As Double;
Private _id2 As Integer;
Private _x As Double;
Private _y As Double;
Private _temperature As Double;

Public Sub New(data As String)
Dim s As String() = data.Substring(1, data.Length - 2).Split(""",""")
_id = Integer.Parse(s(0))
_time = DateTime.Parse(s(1))
_latitude = ParseCoordinate(s(2))
_longitude = ParseCoordinate(s(3))
_id2 = Integer.Parse(s(4))
_x = Double.Parse(s(5))
_y = Double.Parse(s(6))
_tempreature = = Double.Parse(s(7))
End Sub

Public Property Id As Integer
Get
Return _id
End Get
End Property

Public Property Time As
Get
Return _time
End Get
End Property

Public Property Latitude As
Get
Return _latitude
End Get
End Property

Public Property Longitude As
Get
Return _longitude
End Get
End Property

Public Property Id2 As
Get
Return _id2
End Get
End Property

Public Property X As
Get
Return _x
End Get
End Property

Public Property Y As
Get
Return _y
End Get
End Property

Public Property Temperature As
Get
Return _temperature
End Get
End Property

End Class
 
Back
Top