RTF-parser

  • Thread starter Thread starter Christian Jung
  • Start date Start date
C

Christian Jung

Hello,

does someone know if there is an easy way to check if a string which
contains an RTF-text is empty? I mean that the plain text (without
RFT-markers) is empty. What I know is that I can use a RichTextBox to do
that, but I cannot imagine that I have to use a control with all its
overhead just to parse an RTF-text.

Thanks for any idea...

Christian Jung
 
Hi,

I think you could strip all RTF tag with a regular expression. As far as I
know, every RTF tag opens with "{" and ends with "}" (you should refer to
RTF format docs to be sure of that). Then, construct a regexp like this:

\{[^\}]+\} and replace every match with empty string.
 
I looked into this a while ago, and couldn't find any simple way to parse
RTF. Dmitriy's solution might work, but there could be some hidden
"gotchas." (Microsoft's RTF specification notes that some applications emit
RTF text in some nonstandard ways, so the parsers need to be rather robust.
Also, Dmitriy's method might fail in the rare circumstance where the
document contains curly braces in the actual text.)

By far the easiest way, as you've mentioned, would just be to use an RTF
control. The overhead wouldn't be that much, since you would only need one
control. Otherwise, you could try to roll your own RTF parser, although
this wouldn't be trivial. There's some sample C++ code in Appendix A of the
RTF specification.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp

Hope this helps,
Robert Jacobson
 
Actually, if I remember correctly, all tags begin with \
{ and } denote group boundries, like all data belonging to the header group,
for example.
The tricky part with RTF parsers is that every time you see a \ you need to
make sure it really is a tag following it, becuase if it's \\, then it's a
literal backslash.
Some tags like \b are simple (bold), other tags, like non-standard unicode
characters are longer and more complex \uXXXX where XXXX is a number.
There's a special tag, something like \* , that means non-standard markup is
about to be used, and RTF parsers that don't understand the code should
ignore the contents following.
The RegEx isn't going to be simple :-)

-Rob [MVP]

Dmitriy Lapshin said:
Hi,

I think you could strip all RTF tag with a regular expression. As far as I
know, every RTF tag opens with "{" and ends with "}" (you should refer to
RTF format docs to be sure of that). Then, construct a regexp like this:

\{[^\}]+\} and replace every match with empty string.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio
http://x-unity.miik.com.ua/teststudio.aspx
Bring the power of unit testing to VS .NET IDE

Christian Jung said:
Hello,

does someone know if there is an easy way to check if a string which
contains an RTF-text is empty? I mean that the plain text (without
RFT-markers) is empty. What I know is that I can use a RichTextBox to do
that, but I cannot imagine that I have to use a control with all its
overhead just to parse an RTF-text.

Thanks for any idea...

Christian Jung
 
Rob and Robert,

Of course I realize parsing RTF is much more complex than a simple RegExp.
What I wanted to say was rather a direction to start "digging" towards, not
a final solution. Still, it's fine that my posting attracted your critics -
this should definitely help the original poster to avoid the "gotchas"
mentioned.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio
http://x-unity.miik.com.ua/teststudio.aspx
Bring the power of unit testing to VS .NET IDE

Rob Teixeira said:
Actually, if I remember correctly, all tags begin with \
{ and } denote group boundries, like all data belonging to the header group,
for example.
The tricky part with RTF parsers is that every time you see a \ you need to
make sure it really is a tag following it, becuase if it's \\, then it's a
literal backslash.
Some tags like \b are simple (bold), other tags, like non-standard unicode
characters are longer and more complex \uXXXX where XXXX is a number.
There's a special tag, something like \* , that means non-standard markup is
about to be used, and RTF parsers that don't understand the code should
ignore the contents following.
The RegEx isn't going to be simple :-)

-Rob [MVP]

Dmitriy Lapshin said:
Hi,

I think you could strip all RTF tag with a regular expression. As far as I
know, every RTF tag opens with "{" and ends with "}" (you should refer to
RTF format docs to be sure of that). Then, construct a regexp like this:

\{[^\}]+\} and replace every match with empty string.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio
http://x-unity.miik.com.ua/teststudio.aspx
Bring the power of unit testing to VS .NET IDE

Christian Jung said:
Hello,

does someone know if there is an easy way to check if a string which
contains an RTF-text is empty? I mean that the plain text (without
RFT-markers) is empty. What I know is that I can use a RichTextBox to do
that, but I cannot imagine that I have to use a control with all its
overhead just to parse an RTF-text.

Thanks for any idea...

Christian Jung
 
Back
Top