StreamReader / StreamWriter Encoding

Jaroslav Jakes · Jan 24, 2005

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari

=?ISO-8859-2?Q?Marcin_Grz=EAbski?= · Jan 24, 2005

Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin

Jaroslav Jakes · Jan 24, 2005

Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

=?ISO-8859-2?Q?Marcin_Grz=EAbski?= · Jan 24, 2005

hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin

Jaroslav Jakes · Jan 24, 2005

Hi Marcin,

thanks! I understood what I am to do...

Regards - Jari

Marcin Grzêbski said:
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin

Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

Click to expand...

Problem with encoding....	3	May 11, 2004
How to create a .txt file with unicode encoding	1	Mar 27, 2007
Converting text and detecting encoding	3	Jul 4, 2006
Using Stream objects with encoding	3	Sep 6, 2004
I'm using about twice as many bytes of memory as the size of the file	8	Mar 4, 2010
C# and encodings	30	Feb 3, 2009
Threading calling Powerpoint	7	Mar 30, 2004
How to read html files AS IS. Encoding seems to change the characters.	14	Mar 30, 2007

StreamReader / StreamWriter Encoding

Jaroslav Jakes

=?ISO-8859-2?Q?Marcin_Grz=EAbski?=

Jaroslav Jakes

=?ISO-8859-2?Q?Marcin_Grz=EAbski?=

Jaroslav Jakes

Ask a Question

Similar Threads