StreamReader - Multibybe chars - Seek()

  • Thread starter Thread starter Derrick
  • Start date Start date
D

Derrick

I'm writing a home grown csv text file search, have sorted "id" in the first
"column". Other info after that in the "row". I seek half way thru the
file, get to a row boundry, determine "id" that is there, determine if that
id is higher or lower, cut seek range in half, etc, until I find the row I
am looking for.

This works very well, I encoded the file with unicode so all chars will
always be 2 bytes, since seek() goes on byte basis and read goes on char
basis. Is there any "IsLeadByte()" sort of method in C# so that I can keep
files UTF8? (I think that is the default) Unicode works well, but nearly
doubles the file size, and it is only about 90% more than the default
encoding, so I'm guessing there are some 2 byte chars in the default.

Thanks in advance!

Derrick
 
Derrick said:
I'm writing a home grown csv text file search, have sorted "id" in the first
"column". Other info after that in the "row". I seek half way thru the
file, get to a row boundry, determine "id" that is there, determine if that
id is higher or lower, cut seek range in half, etc, until I find the row I
am looking for.

This works very well, I encoded the file with unicode so all chars will
always be 2 bytes, since seek() goes on byte basis and read goes on char
basis. Is there any "IsLeadByte()" sort of method in C# so that I can keep
files UTF8? (I think that is the default) Unicode works well, but nearly
doubles the file size, and it is only about 90% more than the default
encoding, so I'm guessing there are some 2 byte chars in the default.

It's quite easy to tell the first byte in a UTF-8 character: either its
top bit is unset, or its next-to-top bit is set.

The diagram in http://www.cl.cam.ac.uk/~mgk25/unicode.html
would probably help you to see what I mean.

Unfortunately, that doesn't help you a lot for seeking - basically, you
*can't* seek to a particular character in a UTF-8 stream without
examining virtually all the bytes in between.
 
Back
Top