Joergen Bech @ post1.tele.dk> said:
Your best bet is probably automating Word to do the conversion
for you if we are talking about a small number of large conversions
with a great deal of compatibility of features. For a web service or
similar which requires a great number of small conversions to take
place without noticable delay, this might not be the best solution.
Not so sure everyone has the Word component, and Word 2.0 doesn't so much
choke on cascading style sheets as get the runs! (Lots of unformatted plain
text interspersed by randomly recognised formatted features)
I had to do HTML<->RTF conversion for a project (both
directions) and initially bought a $299 component to do it - one
which was supposed to support all the styles required.
Turned out it was a piece of *bleep*: It had problems with
overlapping styles, e.g. <B><I></B></I>. This is fairly typical
of such components out there. Really a bad piece of code
wrapped up in a component and sold on the web to unsuspecting
developers pressed for time.
This is precisely why I need this level of control. <B></B> and <I></I> are
deprecated and not particularly accessible without the overlap which is akin
to a single file trying to be shared by multiple parent directories. The
correct markup is for such overlapping format is
<STRONG><EM></EM></STRONG><EM></EM> so now the markup is well formed and a
brail reader can render the formatting as well. Try to get a commercially
made algorithm to do this reliably - its not going to happen unless you or I
do it ourselves. Converting back however, is key when vital text management
algortithms are missing from the programming language.
I had to spend three weeks rolling my own converter with some
help from subcomponents out there.
Or hindrance? Three weeks sounds pretty good to me. There is a French method
for conversion but I'm not so sure it can handle compound markup (eg. CSS
combined with HTML) even when running the markup through the server emulator
algorithm. Anyway, I've got just the pages to test with, & if it passes I'll
pass it on.
If not, I guess I'll have to wade through the RTF spec and write my own
converter for the elements I'll be allowing. By restricting what elements
can be used in markup, one can simplify the process of ensuring
security-compliance, standards-compliance, accessibility, and XHTML
conversion.
For HTML->RTF I used TidyATL to convert HTML->XHTML.
Then I augmented a piece of XHTML->RTF code I found somewhere
on the internet. Finally, I fed the RTF into an instance of the
RichTextBox object and read it back again in order to clean up
some superflueous parantheses.
For RTF->HTML, I converted a (rather limited) RTF parser written
in C to .Net.
At least I think that was how I did everything. Been a while.
I only had to support font names and sizes, forecolor, backcolor,
bold, italic, strikethrough, indentation, and bulleted lists.
Not images or tables or other fancy things.
[SNIP]
RTF is multipart, so images binary streams are simply bracketed
appropriately in the file. Tables are always trickier (what is higher in the
hierarchy, columns or rows - the answer depends on the format definition!)
so this promises to be an interesting or at least challenging part of the
project - but alas one I cannot avoid!
Again: Whichever component you find out there (better might have
come along in the past two years), make sure they do not choke
on overlapping styles. Also check how well they cope with certain
commonly used special characters outside the ascii range.
And color names. And and ...
Overlapping styles won't be allowed, and simply won't be possible through
the user interface. I'll only give them access to the HTML if I can get a
(X)HTML Validator Class for .NET. As to special characters, there is another
specification I need to dig up unless .NET has a UTF object?
There is something appealing about moulding HTML and RTF into a hierarchy of
clases and sub-classes ala w3c but I'm uncertain of the benefits of such an
approach, other than intimately learning the finer points of classing and
subclassing...?