Cyrillic characters in VS2005

Laurent Bugnion [MVP] · Feb 4, 2007

Hi,

Not totally on topic for this group, but...

A colleague of mine wants to have the HTML editor in VS2005 display
cyrillic characters. While setting the encoding using a META tag works
fine (and the encoding is also displayed accordingly in the document
properties in VS2005), the display itself still doesn't show correct
characters.

We reviewd together the editor's many options, but were unable to see
something about encoding. Is that even possible? If yes, how?

Thanks and greetings,
Laurent

Mihai N. · Feb 4, 2007

A colleague of mine wants to have the HTML editor in VS2005 display

cyrillic characters. While setting the encoding using a META tag works
fine (and the encoding is also displayed accordingly in the document
properties in VS2005), the display itself still doesn't show correct
characters.

We reviewd together the editor's many options, but were unable to see
something about encoding. Is that even possible? If yes, how?

There are only two options to support Cyrillic in the VS Studio editor:
- UTF-8: "File" -> "Advanced Save Options..." then select "Unicode (UTF-8
with signature) - Codepage 65001
And you will have to match the meta
- Cyrillic - Windows (1251): set the default system locale to Russian
and reboot (http://www.mihai-nita.net/20050611a.shtml)

Problems:

1. Although the "Advanced Save Options..." option allows for many other
encodings, when the file is opened next time it is interpreted as
being in the ANSI code page (determined by default system locale).
So you can work on a US system, save as "Cyrillic (KOI8-R) - Codepage 20886",
but when you open the file you will have to say "File" -> "Open", select the
file, click the down-arrow on the "Open" button, click "Open With...", then
select "Source Code (Text) Editor With Encoding" and in that list select
again "Cyrillic (KOI8-R) - Codepage 20886". Quite a pain!
From what I know there is no way to associate a certain encoding to a certain
file, so that you don't have to do this encoding selections every single
time.
A possible work-arrouns is to set the ANSI code page to what you want
(by changing the default system locale), but this means that some encodings
cannot be used (for instance Cyrillic 1251 can be ANSI cp, but KOI8-R
cannot).

2. UTF-8 with BOM can be recognized by VS, but in my opinion (based on the
W3C specs) is that this is not standard. Many browsers will deal with it,
but this does not make it right.
UTF-8 without BOM might be recognized by VS, by "guessing", which means it
can fail for some files.

Long story short:
- VS is not a good editor for stuff in code pages other than ANSI
- VS is not a good editor for HTML
- UTF-8 (no BOM) is a good encoding for HTML pages, no matter the editor

Ok, this is my opinion, I am open for flaming :-)

Laurent Bugnion [MVP] · Feb 4, 2007

Mihai,

Thanks a lot for your very comprehensive post. Do you have another
editor to recommend for HTML with cyrillic characters (or generally non
ANSI characters)?

Laurent

Mihai N. · Feb 5, 2007

Thanks a lot for your very comprehensive post. Do you have another

editor to recommend for HTML with cyrillic characters (or generally non
ANSI characters)?

Unfortunately, not really :-(
I don't have one preferred editor, I am moving between Homesite, Dreamweaver,
Notepad, Word, and a Notepad clone that I have written )and supports whatever
encoding I want).

Homesite 5.5 = sucks for all but Latin 1
Dreamweaver 6 (MX) = dows kind of ok for popular encodings (Latin 1,
Japanese, Chinese, Russian), but bad for others (Hindi, Arabic, for some
reason Korean).
Notepad = ok for UTF-8, since I have a Perl script automatically removing the
BOM before uploading.
Word = ok for everything. I am using it as a "smarter Notepad", since in the
end I run a macro to convert Word styles to html tags and save as Encoded
text (I don't like the HTML produced by Word)

Now, Homesite 5.5 and Dreamweaver 6 are old, so I cannot tell you how the
newer versions behave.

I cannot talk about the new MS HTML editors (Expression familiy), because I
did not use them (and FrontPage is a long-long time, and hated it time :-)

I am not a WYSIWYG type of guy, and I don't write very fancy struff :-)

Also, I have given up encodings a while ago (using Unicode only :-)

, so this
saves some of the grief.

In the end, my 2 cents:

First option:
a. Go with whatever editor you like for the features/price, with the
only condition that is should support UTF-8 (VS is also ok for this)
and that you can type using the script you need
b. Write a small script to remove the BOM if the editor wants one, and
also to do code page conversion, if for some reason UTF-8 is not acceptable
(although I don't see any reason to do that)

Second option:
If the preferred editor does not support UTF-8, then try setting the default
system locale to the preferred language, or use AppLocale
(http://www.microsoft.com/globaldev/tools/apploc.mspx)
This is an option only for code pages that can be ANSI code pages
(so, for Russian, Windows 1251 is ok, but KOI8-R or MacCyrillic are not)

Laurent Bugnion [MVP] · Feb 5, 2007

Hi,

Unfortunately, not really :-(
I don't have one preferred editor, I am moving between Homesite, Dreamweaver,
Notepad, Word, and a Notepad clone that I have written )and supports whatever
encoding I want).

<snip>

Great, thanks,
Laurent

Alexey Smirnov · Feb 5, 2007

The problem is either in document encoding or in system settings.
That is not the problem of the editor, I think.

Laurent Bugnion said:
Hi,

Unfortunately, not really :-(
I don't have one preferred editor, I am moving between Homesite,
Dreamweaver, Notepad, Word, and a Notepad clone that I have written )and
supports whatever encoding I want).

Click to expand...

<snip>

Great, thanks,
Laurent
--
Laurent Bugnion [MVP ASP.NET]
Software engineering, Blog: http://www.galasoft-LB.ch
PhotoAlbum: http://www.galasoft-LB.ch/pictures
Support children in Calcutta: http://www.calcutta-espoir.ch

Mihai N. · Feb 6, 2007

The problem is either in document encoding or in system settings.

That is not the problem of the editor, I think.

I am not sure what you mean.
It depends what one expects from an editor.

Encodings used to be a problem some 7 years ago, and the limitations where
from the system (although NT was Unicode, Win 2000 was the first version that
was really usefull for multilingual work).

At this time (2007 :-)

I do expect from an editor to support any encoding
and script I want, if it is running on a post-Win 2000 OS and the support
is installed.
If it does not, then it is the problem of the editor, in my book.

Joerg Jooss · Feb 6, 2007

Thus wrote Mihai N.,

There are only two options to support Cyrillic in the VS Studio
editor:
- UTF-8: "File" -> "Advanced Save Options..." then select "Unicode
(UTF-8
with signature) - Codepage 65001
And you will have to match the meta
- Cyrillic - Windows (1251): set the default system locale to Russian
and reboot (http://www.mihai-nita.net/20050611a.shtml)
Problems:

1. Although the "Advanced Save Options..." option allows for many
other
encodings, when the file is opened next time it is interpreted as
being in the ANSI code page (determined by default system locale).

Which isn't suprising unless VS would store the last applied character encoding
for each file -- which wouldn't hurt ;-)

[...]

2. UTF-8 with BOM can be recognized by VS, but in my opinion (based on
the
W3C specs) is that this is not standard. Many browsers will deal with
it,
but this does not make it right.
UTF-8 without BOM might be recognized by VS, by "guessing", which
means it
can fail for some files.

There's nothing non-standard about UTF-8 with a BOM. The only thing you might
want to avoid is saving a pure HTML file with a BOM, because some (older)
browsers happily include the BOM in the rendered text :-/

But for a compilation unit such as .aspx or .cs, there should be no problem
using a BOM. In this case, the build tool (page translator, compiler, etc.)
deals with it. The buildtime encoding doesn't need to match the runtime encoding
anyway[*], and using UTF-8 as runtime encoding shouldn't procude a BOM in
the output stream, unless you really ask for it.

That also means that UTF-16 or UTF-32 work as universal buildtime encoding
as well, because these encoding always include a BOM.

[*] That's the technical point of view. I would never ever use a buildtime
encoding that can represent more characters than my runtime encoding, because
this is an excellent way to introduce broken content...

Cheers,

Mihai N. · Feb 7, 2007

Which isn't suprising unless VS would store the last applied character

encoding for each file -- which wouldn't hurt ;-)

The project file can be a great place to store that kind of info.
But, the fact still remains: VS is not a creation HTML tools.
I like it, is very strong, can do a decent job for a lot of file formats,
bat I would not push it too much :-)

And in fact, MS does not do it either,
this is whay they have dedicated html tools :-)

There's nothing non-standard about UTF-8 with a BOM. The only thing you
might
want to avoid is saving a pure HTML file with a BOM, because some (older)
browsers happily include the BOM in the rendered text :-/

When in doubt, I go to the standard :-)

I have no opinion on UTF-8 + BOM in general, but I do have opinions on
UTF-8 + BOM in the context of various file formats.
For some formats is good, for some is not only bad, but non-standard.
One of the good documents is this:
http://unicode.org/unicode/faq/utf_bom.html#BOM
<<Where the precise type of the data stream is known, the BOM should not be
used.>>
In general, Unicode consistently tries to leave decisions to higher-levels
protocols.
There are clear standard methods to identify the encoding of an HTML page,
both as stand-alone file, and as served over HTTP. There is no need for
another one.
And both the HTML and XML (implying XHTML) standards have clear ways to
determine the encoding.
The fact that some browsers handle it properly does not mean is standard.

Fast, from the top of your head, who is the winner here:
- the http header from the server (determined by the server's config) says
Content-Type: text/html; charset=ISO-8859-1
- the html file has a Content-Type in the head section
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
- the html file has a UTF-8 BOM at the very beginning (EF BB BF)
- the content itself is in fact Shift-JIS

If you got the winner right, then why (according to what standard)? :-)

But for a compilation unit such as .aspx or .cs, there should be no problem
using a BOM. In this case, the build tool (page translator, compiler, etc.)
deals with it.
The buildtime encoding doesn't need to match the runtime encoding
anyway[*], and using UTF-8 as runtime encoding shouldn't procude a BOM in
the output stream, unless you really ask for it.

Agree. Although these days mixing encodings is just a way to ask for trouble,
or show off (look how cool I am, you can master such a mess :-)

There is no reason to be anything other than Unicode. Ten years ago, yes.
Now there are still some exceptions, but fewer and fewer.

Joerg Jooss · Feb 8, 2007

Thus wrote Mihai N.,

The project file can be a great place to store that kind of info.
But, the fact still remains: VS is not a creation HTML tools.

Sure. Go Expression Web :-)

I like it, is very strong, can do a decent job for a lot of file
formats,
bat I would not push it too much And in fact, MS does not do it
either,
this is whay they have dedicated html tools

I really don't think character encoding is an HTML phenonemon, so having
such a feature in VS would be useful for everybody ;-)

When in doubt, I go to the standard
I have no opinion on UTF-8 + BOM in general, but I do have opinions on
UTF-8 + BOM in the context of various file formats.
For some formats is good, for some is not only bad, but non-standard.
One of the good documents is this:
http://unicode.org/unicode/faq/utf_bom.html#BOM
<<Where the precise type of the data stream is known, the BOM should
not be
used.>>

In our case (VS), the precise type of data stream is often unknown.

In general, Unicode consistently tries to leave decisions to
higher-levels
protocols.
There are clear standard methods to identify the encoding of an HTML
page,
both as stand-alone file, and as served over HTTP. There is no need
for
another one.
And both the HTML and XML (implying XHTML) standards have clear ways
to
determine the encoding.
The fact that some browsers handle it properly does not mean is
standard.

I don't know how this relates to our topic... I was not talking about standards?
VS usually doesn't load source files via HTTP, nor is every source file XML
or META tagged. These standards aren't applicable as a whole to a design
time environment.

Cheers,

Mihai N. · Feb 9, 2007

I really don't think character encoding is an HTML phenonemon, so having

such a feature in VS would be useful for everybody ;-)

Nothing against :-)

In our case (VS), the precise type of data stream is often unknown.

This is VS's fault (because it is unaware of the HTML ways of specifying
encoding).

VS usually doesn't load source files via HTTP, nor is every source file XML
or META tagged. These standards aren't applicable as a whole to a design
time environment.

Well, some of the standards are applicable.
And, happily enough VS respects them. It is a nice surprise!
VS respects the meta in the head section!
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Try this:
1. save the first page from www.yahoo.co.jp It is encoded using euc-jp.
2. Try opening it in Notepad, you will see junk.
3. Open it in VS, you see Japanese (if you have Japanese support installed).
change the meta from <meta http-equiv="Content-Type" content="text/html;
charset=euc-jp"> to <meta http-equiv="Content-Type" content="text/html;
charset=utf-8"> and save.
4. Open it in Notepad again, you will see Japanese. The ending is utf-8

So VS will save the HTML according to the meta.
Standard compliant, no need to add a BOM, store the encoding somewhere,
or ask every single time. Nice and correct!

So, a better answer for the original question!
I have tried setting the proper encoding in the meta and Russian works fine.
Tested it with KOI8-R, windows-1251 and utf-8

Joerg Jooss · Feb 10, 2007

Thus wrote Mihai N.,

This is VS's fault (because it is unaware of the HTML ways of
specifying encoding).

My point was that there are source files to which HTML specific rules don't
apply -- such as C#, VB, JavaScript.

Well, some of the standards are applicable.
And, happily enough VS respects them. It is a nice surprise!
VS respects the meta in the head section!
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

I know, although in some quick tests VS didn't keep file encoding and META
tag synchronized all the time. I don't create a lot of plain HTML content,
so I don't have any insight how reliable that feature is.

Try this:
1. save the first page from www.yahoo.co.jp It is encoded using
euc-jp.
2. Try opening it in Notepad, you will see junk.
3. Open it in VS, you see Japanese (if you have Japanese support
installed).
change the meta from <meta http-equiv="Content-Type"
content="text/html;
charset=euc-jp"> to <meta http-equiv="Content-Type"
content="text/html;
charset=utf-8"> and save.
4. Open it in Notepad again, you will see Japanese. The ending is
utf-8
So VS will save the HTML according to the meta.
Standard compliant, no need to add a BOM, store the encoding
somewhere,
or ask every single time. Nice and correct!

Mihai, fire up your favorite hex editor and check the first three bytes of
the file: EF BB BF here. ;-)

Cheers,

Mihai N. · Feb 10, 2007

My point was that there are source files to which HTML specific rules don't

apply -- such as C#, VB, JavaScript.

Ah!
I have no problem with this.
As stated somewhere "I do have opinions on UTF-8 + BOM in the context of
various file formats."
So my opinion on the BOM depends on the file format. This is *no* for
html/xml, but might be yes for C#, VB, JavaScript.
And, in fact, C# & VB are MS formats, so if they decide to use 3 BOMs
at the end to identify the encoding, I have no problem with it
(but I will say "WTF?" :-)

Ok, joking aside, I think BOM in C# and VB files is a good thing.
I need to think a bit more about JavaScript.

Mihai, fire up your favorite hex editor and check the first three bytes of
the file: EF BB BF here. ;-)

Damn! Me no like it :-)

Cyrillic characters in VS2005

Laurent Bugnion [MVP]

Mihai N.

Laurent Bugnion [MVP]

Mihai N.

Laurent Bugnion [MVP]

Alexey Smirnov

Mihai N.

Joerg Jooss

Mihai N.

Joerg Jooss

Mihai N.

Joerg Jooss

Mihai N.