StreamReader / StreamWriter Encoding

  • Thread starter Thread starter Jaroslav Jakes
  • Start date Start date
J

Jaroslav Jakes

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari
 
Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
 
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari
 
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
 
Hi Marcin,

thanks! I understood what I am to do...

Regards - Jari

Marcin Grzêbski said:
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari
 
Back
Top