Problem writing non-englisg characters (re-post)

  • Thread starter Thread starter barry
  • Start date Start date
B

barry

Thanks for your reply.

I am parsing a French website, i found that it reads the characters
correctly when i open the .html file using

TextReader tr = new StreamReader("XYZ.htm", new UTF7Encoding(true)) ;

if am able to parse the data correctly, i have checked this using the Visual
Studio 2003 Debugger, problem arises after writing the data to file which i
open using

TextWriter tw = new StreamWriter("zxy.txt", false, new UTF7Encoding(true));
tw.WriteLine("", new UTF7Encoding(true));
tw.close();
 
Thanks for your reply.

It's generally a good idea to keep all posts within the same thread
rather than starting a new thread each time.
I am parsing a French website, i found that it reads the characters
correctly when i open the .html file using

TextReader tr = new StreamReader("XYZ.htm", new UTF7Encoding(true)) ;

That sounds unlikely. UTF-7 is used in very specific circumstances
(mail, IIRC).

How are you validating that it's reading the characters correctly?
if am able to parse the data correctly, i have checked this using the Visual
Studio 2003 Debugger, problem arises after writing the data to file which i
open using

TextWriter tw = new StreamWriter("zxy.txt", false, new UTF7Encoding(true));
tw.WriteLine("", new UTF7Encoding(true));
tw.close();

You really don't want to be using UTF7, either for reading or writing.

Jon
 
Sorry, i have not delebrately posted another thread, its cause i was
selecting "Reply" instead of "Reply Group" and did not have a proper e-mail
address setup and hence my message was not going through.

Now regarding the issue

How do i read from a website which has those odd characters like
"élémentaire" and then write them to a file.

do i use UTF8Encoding (default) or someother encoding for both reading and
writing these type of characters. Could you point me to some code or give
some idea.
 
Sorry, i have not delebrately posted another thread, its cause i was
selecting "Reply" instead of "Reply Group" and did not have a proper e-mail
address setup and hence my message was not going through.

Now regarding the issue

How do i read from a website which has those odd characters like
"élémentaire" and then write them to a file.

do i use UTF8Encoding (default) or someother encoding for both reading and
writing these type of characters. Could you point me to some code or give
some idea.

You need to use whatever encoding the data is transmitted in to read
it. This is usually indicated in the content-type header.

You need to use whatever encoding you want the output in to write it.
That will be determined by what you want to do with it afterwards, but
UTF-8 is a good starting point.

Jon
 
UTF7Encoding has worked, i was viewing the .csv file in Excel which was
showing those characters, but it was correct in the .csv file.

thanks anyway


Sorry, i have not delebrately posted another thread, its cause i was
selecting "Reply" instead of "Reply Group" and did not have a proper
e-mail
address setup and hence my message was not going through.

Now regarding the issue

How do i read from a website which has those odd characters like
"élémentaire" and then write them to a file.

do i use UTF8Encoding (default) or someother encoding for both reading and
writing these type of characters. Could you point me to some code or give
some idea.

You need to use whatever encoding the data is transmitted in to read
it. This is usually indicated in the content-type header.

You need to use whatever encoding you want the output in to write it.
That will be determined by what you want to do with it afterwards, but
UTF-8 is a good starting point.

Jon
 
UTF7Encoding has worked, i was viewing the .csv file in Excel which was
showing those characters, but it was correct in the .csv file.

If UTF7Encoding is working at the moment, it's almost certainly a
coincidence. No reasonable website will be using UTF7Encoding.

When you say "it was correct in the .csv file" what exactly do you
mean? How did you determine that?

Jon
 
if you paste the text in Notepad.exe it will show correctly but not in
Wordpad or excel.

The problem is not programming related but rather of installing the right
codepages (how that is to be done i have no idea), i got this from another
forum.

Anyway thanks for you not so useful posting.

You should avoid being the first to post an answer, even if you its very
round-and-about.
 
You wrote that UTF7Encoding is Coincidence

following is some helpful code i found in MSDN

using System;
using System.Text;

class UTF7EncodingExample {
public static void Main() {
// Create a UTF-7 encoding.
UTF7Encoding utf7 = new UTF7Encoding();

// A Unicode string with two characters outside a 7-bit code range.
String unicodeString =
"This Unicode string contains two characters " +
"with codes outside a 7-bit code range, " +
"Pi (\u03a0) and Sigma (\u03a3).";
Console.WriteLine("Original string:");
Console.WriteLine(unicodeString);

// Encode the string.
Byte[] encodedBytes = utf7.GetBytes(unicodeString);
Console.WriteLine();
Console.WriteLine("Encoded bytes:");
foreach (Byte b in encodedBytes) {
Console.Write("[{0}]", b);
}
Console.WriteLine();

// Decode bytes back to string.
// Notice Pi and Sigma characters are still present.
String decodedString = utf7.GetString(encodedBytes);
Console.WriteLine();
Console.WriteLine("Decoded bytes:");
Console.WriteLine(decodedString);
}
}
 
barry said:
if you paste the text in Notepad.exe it will show correctly but not in
Wordpad or excel.

Paste the text from where? A web browser? That will already have done
some decoding.
The problem is not programming related but rather of installing the right
codepages (how that is to be done i have no idea), i got this from another
forum.

So you're just assuming that the other answer is correct?
Anyway thanks for you not so useful posting.

You should avoid being the first to post an answer, even if you its very
round-and-about.

Again, you're assuming my answer is incorrect...
 
barry said:
You wrote that UTF7Encoding is Coincidence

following is some helpful code i found in MSDN

<snip>

That code just shows that UTF-7 can roundtrip. It doesn't in *any way*
confirm that the web site you're using is actually returning data in
UTF-7.
 
Jon Skeet said:
So you're just assuming that the other answer is correct?

Oh, and if you haven't got the right code pages installed, how do you
think Notepad is showing the right characters?

Have you even tried looking at the headers returned from the web server
to find out what it's claiming to use?

You can assume I'm ignorant about encodings etc if you like, but it
won't help you get to an answer. You might like to read up on
encodings:

http://yoda.arachsys.com/csharp/unicode.html
http://yoda.arachsys.com/csharp/debuggingunicode.html
 
Jon Skeet said:
Paste the text from where? A web browser? That will already have done
some decoding.

Copy paste the following line of text in Notepad, Excel and Wordpad

Ecole élémentaire privée Notre-Dame
 
barry said:
Copy paste the following line of text in Notepad, Excel and Wordpad

Ecole élémentaire privée Notre-Dame

That happens to work fine in all of them, from my particular newsreader
- but clipboard operations are relatively complicated. They have very
little to do with what you're doing, which is reading binary data from
a web site, interpreting is as text, and then writing those characters
back to disk (which is inherently binary).

Cut and paste is *not* a good test of encodings.
 
now going back to the earlier postings,

I wrote that, it is possible to read from website and write them to a .csv
file, but when i open the file in Excel or Wordpad there are problems of
display, not in Notepad or Edit Plus (the editor i use).

I mentioned copy paste since sending the file here is not possible.

I have competed the project i was working one way or the other.


barry said:
Copy paste the following line of text in Notepad, Excel and Wordpad

Ecole élémentaire privée Notre-Dame

That happens to work fine in all of them, from my particular newsreader
- but clipboard operations are relatively complicated. They have very
little to do with what you're doing, which is reading binary data from
a web site, interpreting is as text, and then writing those characters
back to disk (which is inherently binary).

Cut and paste is *not* a good test of encodings.
 
Dans : barry disait :
now going back to the earlier postings,

I wrote that, it is possible to read from website and write them to a
.csv file, but when i open the file in Excel or Wordpad there are
problems of display, not in Notepad or Edit Plus (the editor i use).

I mentioned copy paste since sending the file here is not possible.

I have competed the project i was working one way or the other.

Well,

I guess the html page you are reading is UTF-8 encoded (according to the
example you gave in your first post).
As you read then write with the same, probably wrong, encoding, the
second error cancel the first one. Of course, it works only in the case
some bytes sequences are not forbidden in the encoding you use.

Now, why notepad can read it and not worpad ? I know that wordpad
doesn't recognize UTF-8 encoding when the Byte Order Mark is not
present. When notepad can recongnize it without the Byte Order Mark.
 
barry said:
now going back to the earlier postings,

I wrote that, it is possible to read from website and write them to a .csv
file, but when i open the file in Excel or Wordpad there are problems of
display, not in Notepad or Edit Plus (the editor i use).

To look at what's really in a file, you need to use a binary editor
really.
I mentioned copy paste since sending the file here is not possible.

But copy/pasting removes half of the useful information. A better
solution would have been to put a sample file up on the web (or if
you're fetching from a public web site, just give us the URL).
I have competed the project i was working one way or the other.

If you're still using UTF-7, chances are it's not going to work
properly for all data though...
 
Hi,

Thanks for your offer to scrutinize the problem.

The url in question containg over 1 million (more the 13000 pages) names
and email address, i am sorry i cannot part with that information.
 
barry said:
Thanks for your offer to scrutinize the problem.

The url in question containg over 1 million (more the 13000 pages) names
and email address, i am sorry i cannot part with that information.

Okay (although I hope this isn't being used for spam...)

Are you able to post just the headers? That would probably give us all
the information we need.
 
Here is the header of the .csv file (it has some special characters)

N°UAI,Etat,Ouverture,N°SIRET,Secteur,Fermeture,N°FINESS,Contrat,MAJ,Sigle,Tel,Appellation,Fax,Denomination,Mel,Patronyme,Mention
distribution,Adresse,Lieu dit,Code postal,Boite postale,Acheminement,Commune
d'implantation,Site,Nature,Niveau,Date Ouverture,Date Fermeture,Date de Mise
à Jour,Catégorie juridique,Catégorie financière,Hebergement,Situation
comptable,Ministère tutelle,Tutelle secondaire,Etat SIRAD
 
barry said:
Here is the header of the .csv file (it has some special characters)

No, not the CSV header - the headers returned by the web server. They
should specify which encoding to use.
 
Back
Top