RTF-parser

Christian Jung · Dec 11, 2003

Hello,

does someone know if there is an easy way to check if a string which
contains an RTF-text is empty? I mean that the plain text (without
RFT-markers) is empty. What I know is that I can use a RichTextBox to do
that, but I cannot imagine that I have to use a control with all its
overhead just to parse an RTF-text.

Thanks for any idea...

Christian Jung

Dmitriy Lapshin [C# / .NET MVP] · Dec 11, 2003

Hi,

I think you could strip all RTF tag with a regular expression. As far as I
know, every RTF tag opens with "{" and ends with "}" (you should refer to
RTF format docs to be sure of that). Then, construct a regexp like this:

\{[^\}]+\} and replace every match with empty string.

Robert Jacobson · Dec 11, 2003

I looked into this a while ago, and couldn't find any simple way to parse
RTF. Dmitriy's solution might work, but there could be some hidden
"gotchas." (Microsoft's RTF specification notes that some applications emit
RTF text in some nonstandard ways, so the parsers need to be rather robust.
Also, Dmitriy's method might fail in the rare circumstance where the
document contains curly braces in the actual text.)

By far the easiest way, as you've mentioned, would just be to use an RTF
control. The overhead wouldn't be that much, since you would only need one
control. Otherwise, you could try to roll your own RTF parser, although
this wouldn't be trivial. There's some sample C++ code in Appendix A of the
RTF specification.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp

Hope this helps,
Robert Jacobson

Rob Teixeira [MVP] · Dec 12, 2003

Actually, if I remember correctly, all tags begin with \
{ and } denote group boundries, like all data belonging to the header group,
for example.
The tricky part with RTF parsers is that every time you see a \ you need to
make sure it really is a tag following it, becuase if it's \\, then it's a
literal backslash.
Some tags like \b are simple (bold), other tags, like non-standard unicode
characters are longer and more complex \uXXXX where XXXX is a number.
There's a special tag, something like \* , that means non-standard markup is
about to be used, and RTF parsers that don't understand the code should
ignore the contents following.
The RegEx isn't going to be simple :-)

-Rob [MVP]

Dmitriy Lapshin said:
Hi,

I think you could strip all RTF tag with a regular expression. As far as I
know, every RTF tag opens with "{" and ends with "}" (you should refer to
RTF format docs to be sure of that). Then, construct a regexp like this:

\{[^\}]+\} and replace every match with empty string.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio
http://x-unity.miik.com.ua/teststudio.aspx
Bring the power of unit testing to VS .NET IDE

Christian Jung said:

Hello,

does someone know if there is an easy way to check if a string which
contains an RTF-text is empty? I mean that the plain text (without
RFT-markers) is empty. What I know is that I can use a RichTextBox to do
that, but I cannot imagine that I have to use a control with all its
overhead just to parse an RTF-text.

Thanks for any idea...

Christian Jung

Click to expand...

Dmitriy Lapshin [C# / .NET MVP] · Dec 12, 2003

Rob and Robert,

Of course I realize parsing RTF is much more complex than a simple RegExp.
What I wanted to say was rather a direction to start "digging" towards, not
a final solution. Still, it's fine that my posting attracted your critics -
this should definitely help the original poster to avoid the "gotchas"
mentioned.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio
http://x-unity.miik.com.ua/teststudio.aspx
Bring the power of unit testing to VS .NET IDE

Rob Teixeira said:
Actually, if I remember correctly, all tags begin with \
{ and } denote group boundries, like all data belonging to the header group,
for example.
The tricky part with RTF parsers is that every time you see a \ you need to
make sure it really is a tag following it, becuase if it's \\, then it's a
literal backslash.
Some tags like \b are simple (bold), other tags, like non-standard unicode
characters are longer and more complex \uXXXX where XXXX is a number.
There's a special tag, something like \* , that means non-standard markup is
about to be used, and RTF parsers that don't understand the code should
ignore the contents following.
The RegEx isn't going to be simple

-Rob [MVP]

Dmitriy Lapshin said:

Hi,

I think you could strip all RTF tag with a regular expression. As far as I
know, every RTF tag opens with "{" and ends with "}" (you should refer to
RTF format docs to be sure of that). Then, construct a regexp like this:

\{[^\}]+\} and replace every match with empty string.

--
Dmitriy Lapshin [C# / .NET MVP]
X-Unity Test Studio
http://x-unity.miik.com.ua/teststudio.aspx
Bring the power of unit testing to VS .NET IDE

Christian Jung said:

Hello,

does someone know if there is an easy way to check if a string which
contains an RTF-text is empty? I mean that the plain text (without
RFT-markers) is empty. What I know is that I can use a RichTextBox to do
that, but I cannot imagine that I have to use a control with all its
overhead just to parse an RTF-text.

Thanks for any idea...

Christian Jung

Click to expand...

Click to expand...

Rich Text parsing with RichTextBox	1	Aug 18, 2005
How should I convert Rtf to plain text?	3	Jun 26, 2006
RichTextBox RTF Hyperlinks	1	Jun 9, 2007
RTF text in resource file	10	Feb 24, 2011
Save RichTextBox as RTF in A4 and landscape	4	Jan 11, 2010
Append rtf content to a richtextbox	1	Nov 30, 2006
RichTextControl Rtf property	2	Jul 7, 2004
RichTextBox.RTF handle	1	Jun 4, 2008

RTF-parser

Christian Jung

Dmitriy Lapshin [C# / .NET MVP]

Robert Jacobson

Rob Teixeira [MVP]

Dmitriy Lapshin [C# / .NET MVP]

Ask a Question

Similar Threads