I recently started using a laser printer. I quickly discovered that in
some cases, international characters (vowels with umlauts, vowels with
accents, etc.) required different binary codes when entering them in an
ordinary text file than when entering them in an RTF file.
If I take a text file containing these symbols entered by an ordinary text
editor and insert them into an RTF file, they end up as garbage.
I'm not sure whether that is just a requirement of the program that I'm
using (Atlantis-Nova) or some other reason.
If somebody would explain why there is a difference in some cases, I would
appeciate it very much.
Hi,
I can sympathize, having struggled to work with Japanese characters
using LaTeX, text files, and making efforts to port files from program
to program... yuck!
First, there appear to be many different versions of RTF, which may make
it a bit tricky to give you a solution right now:
http://en.wikipedia.org/wiki/Rich_Text_Format
Here the relevant character encoding paragraph from the Wikipedia page,
which I hope can help you (depending maybe on what the RTF version is
that the program you are reading it with can interpret?):
==================
RTF is an 8-bit format.[26] That would limit it to ASCII,[26] but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and, starting with RTF 1.5, Unicode escapes. In a code page escape, two hexadecimal digits following a backslash and typewriter apostrophe are used for denoting a character taken from a Windows code page. For example, if the code page is set to Windows-1256, the sequence \'c8 will encode the Arabic letter b?? (?).
For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode code point number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter beth, specifying that older programs which do not have Unicode support should render it as a question mark instead.
The control word \uc0 can be used to indicate that subsequent Unicode escape sequences within the current group do not specify a substitution character.
Until RTF specification version 1.5 release in 1997, RTF has only handled 7-bit characters directly and 8-bit characters encoded as hexadecimal (using \'xx). RTF control words (since RTF 1.5) generally accept signed 16-bit numbers as arguments. Unicode values greater than 32767 must be expressed as negative numbers.[13] If a Unicode character is outside BMP, it cannot be expressed in RTF.[27] Support for Unicode was made due to text handling changes in Microsoft Word – Microsoft Word 97 is a partially Unicode-enabled application and it handles text using the 16-bit Unicode character encoding scheme.[13] Microsoft Word 2000 and later versions are Unicode-enabled applications that handle text using the 16-bit Unicode character encoding scheme.[3]
RTF files are usually 7-bit ASCII plain text. RTF consists of control words, control symbols, and groups. RTF files can be easily transmitted between PC based operating systems because are encoded as a text file with 7-bit graphic ASCII characters. Converters that communicate with Microsoft Word for MS Windows or Macintosh should expect data transfer as 8-bit characters and binary data can contain any 8-bit values.[15]
==================
Spec of RTF, Version 1.5:
http://www.biblioscape.com/rtf15_spec.htm
Here is something about how Unicode characters can be encoded into RTF
(from version 1.6 maybe?):
http://latex2rtf.sourceforge.net/rtfspec_6.html
More on RTF conversion for current version 1.9:
http://www.codeproject.com/KB/recipes/RtfConverter.aspx
Hope that helps,