change XML encoding

  • Thread starter Thread starter Keith G Hicks
  • Start date Start date
K

Keith G Hicks

Okay, I need to clean up these files. They are coming out of this goofy
system with this header:

<?xml version=?1.0? encoding=?UTF-8??>

The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:

<?xml version="1.0" encoding="ISO-8859-1"?>


So I guess I need to change the encoding of each file before I can open it
up as an XML doc and read it there. I have no idea what is the best way to
do this programmatically in vb.net. Do I need to open with StreamWriter or
is there an easier way? I can't find anything out there that explains this
clearly. If I need to do this with streamwriter could someone point me
somewhere that shows how to do this?

Thanks,

Keith
 
Keith G Hicks said:
Okay, I need to clean up these files. They are coming out of this goofy
system with this header:

<?xml version=?1.0? encoding=?UTF-8??>

The quotes around things are not coming in as quotes. And it's not the
correct encoding anyway. It needs to be this:

<?xml version="1.0" encoding="ISO-8859-1"?>


So I guess I need to change the encoding of each file before I can open it
up as an XML doc and read it there. I have no idea what is the best way to
do this programmatically in vb.net. Do I need to open with StreamWriter or
is there an easier way? I can't find anything out there that explains this
clearly. If I need to do this with streamwriter could someone point me
somewhere that shows how to do this?

Thanks,

Keith

Well, if the XML files were "well-formed", you'd simply load them up into a
W3C compliant XML DOM Document, which Microsoft makes available in the
System.Xml namespace with the XmlDocument class. Now, with LINQ, we also
have the XDocument type, which I believe is much easier to work with and can
be declared with an inference, as in:

Dim someXML = "...xml goes here..."

The problem is that right now, you can't use either of these because your
XML isn't well-formed. Your first goal should be to try to get the XML that
you are initially receiving to be well-formed.

As for the encoding, you can read the original XML into a new XML DOM
Document or XDocument and set the encoding of that new document.

Where are these XML streams coming from in the first place?
 
They're coming from a crappy Mac system that is very inflexible. They have
almost no control over how these get output. I wish i could get them
well-formed but I'm sort of stuck.
 
The first line of the file's I'm getting is fouled up and so I cannot
open/read it at all using any XML features in VB. The first line is not
recognizeable. It's coiming to me saying it's UTF-8 but it's not and the
double quotes in the header are not coming to me as double quotes.

When I use StreamReader, alter the fist line and then save it as a new
file, that almost works but the characters that need to have the correct
encoding actually get changed to something else in the save process. I'm
guessing the stream reader is interpreting them funny and so it doesn't
really matter what I change the header to, the characters themselves change
(I checked in a hex editor to be sure).

So since it works to manually open these files in notepad and simply change
the header to the correct encoding, the characters themselves MUST have the
correct binary values. All that needs to be done is to change that header to
the right encoding without fouling up the characters in the body.

So how can I open the file in the most raw form of text, replace that first
line and save it without changing the characters in question in the process?

I made some progress with this:

Dim sr As New StreamReader(xmlFilesLocation & "\" & sArticleToPost,
Encoding.UTF7)

Dim text As String = sr.ReadToEnd

Dim text2() As String

ReDim text2(1)

text2(0) = text.Replace("<?xml version=1.0 encoding=UTF-8?>", "<?xml
version=""1.0"" encoding=""ISO-8859-1""?>")

System.IO.File.WriteAllLines(xmlFilesLocation & "\x" & sArticleToPost,
text2)


The text2 variable shows the correct characters and when I copy its value
into notepad it's fine. But it doesn't save right. I still get weirder
characters than I want. It's supposed to have characters like N with a
tilde, O with a tilde, O with an accent mark, etc. There are about 6 or 7 I
expect to see in this file. But when I open the newly saved files, those
characters are converted into very strange characters that I'd have to show
you.


I have a question regarding all of this. The encoding header merely tells
the program that's opening the file how to read the characters that are in
it. The characters are of course ultimately stored in binary so the encoding
knows how to interpret the binary into readable characters. If I open a file
using one encoding and the characters look a certain way and then save it
using another, the characters change binary. Is this all true? Am I
understandign this or not? I mean the 0's and 1's that are stored on disk
don't change just cuz of the way you open it. If you open it using one
interpreter (encoding) adn they look this way then open using another
encoding you'll see different characters. that makes sense to me. So the
only way I could see the binary changing is if the encoding used when saving
reinterprets the charcters to different string of 1's and 0's. Yes?

Okay, so when I choose the "encoding" parameter of StreamReader, there are
only about 5 options (UTF-7, UTF-8, UTF-32, ASCII, Default, ...) How do I
tell it I want it to read AND SAVE as ISO-8859-1????

Opening UTF-7 seems to help but OMG when I save using UTF-7 things are a big
mess.


Thanks,

Keith
 
Keith said:
Okay, so when I choose the "encoding" parameter of StreamReader, there are
only about 5 options (UTF-7, UTF-8, UTF-32, ASCII, Default, ...) How do I
tell it I want it to read AND SAVE as ISO-8859-1????

Encoding.GetEncoding("ISO-8859-1") should give an Encoding instance
allowing you to decode and encode with IS0-8859-1.
And both StreamReader and StreamWriter allow you to specify an encoding,
for instance StreamWriter has
http://msdn.microsoft.com/en-us/library/f5f5x7kt.aspx
 
Yep. I found that out late last night. Thanks. It took quite a bit of
hunting around to figure this out. It's not intuitive. "GetEncoding" sounds
like a read only property. The word "get" is misleading. I finally landed
upon something that I read that explained that it was more like "Set" than
"Get". Now I'm sure they meant that GetEncoding("ISO-8859-1") means to "get"
the encoding of "ISO-8859-1" but that's a bit ambiguous. With
Encoding.GetEncoding as the 3rd param of StreamReader, it also could be
interrpeted as "get the current encoding of the stream".

Thanks for the info.
 
Keith said:
Yep. I found that out late last night. Thanks. It took quite a bit of
hunting around to figure this out. It's not intuitive. "GetEncoding" sounds
like a read only property. The word "get" is misleading. I finally landed
upon something that I read that explained that it was more like "Set" than
"Get". Now I'm sure they meant that GetEncoding("ISO-8859-1") means to "get"
the encoding of "ISO-8859-1" but that's a bit ambiguous. With
Encoding.GetEncoding as the 3rd param of StreamReader, it also could be
interrpeted as "get the current encoding of the stream".

I believe that it's just your preconception about what the GetEncoding
method does that makes it seem misleading.

The GetEncoding method interprets the parameter that you send to it and
returns an Encoding object that handles that specific encoding.

If you want to set the default encoding for the system, that is not done
using the Encoding class itself. You create the appropriate encoding
object and apply it to a context, depending on how wide a scope you want
it to afffect.
 
Back
Top