html parsing / regular expressions

  • Thread starter Thread starter yonido
  • Start date Start date
Y

yonido

hello,

my goal is to get patterns out of email files - say "message
forwarding" patterns (message forwarded from: xx to: yy subject: zz)
now lets say there are tons of these patterns (by gmail, outlook, etc)
- and i want to create some rules of how to get them out of the mail's
html body.

so at first i tried using regular expressions: for example - "any
pattern that starts with a <p> and contains "from:"..." etc.
then i understood that its not that simple, because different engines
change the content of them html - and i cant expect spefic tags (what
if a <p> is added? or a <span>)

then ive been guided to use an html parser, heard of GOLD and ANTLR.
but no clue how that can help.

html parsing sounds better - because i really care for what the final
SEEN result is, and not the STRUCTURE of it.

any slightest light of how this problem would be appreceated.
 
You really don't want to get into the whole HTML-parsing mess. HTML itself
is a mess, and parsing it is quite difficult.

I think you were on the right track with looking for patterns. The HTML tags
enclosing the data are unimportant. But the data is. So, the first thing you
probably want to do is locate email addresses. There are a number of
patterns for identifying and even parsing email addresses. Just look for
them.

Next, you need to get the context in which these messages appear. For that,
you'll need to figure out the rules, which means that you may need to
separate content from HTML tags. And for that, what you really need to do is
to remove all HTML tags, not parse them. But an email address may contain
"<" and ">" characters around different parts, depending on the format (to
enclose a user name, etc, that is not part of the email address). But those
characters, if they are in the HTML, will not be those characters, but
HTML-Encoding for those characters, i.e. "&LT;" and "&gt;". In the pure
HTML, anything inside an actual "<" or ">" will be an HTML tag. So, you may
want to remove all of them first, and then look for the data you're seeking,
by figuring out the rules for the patterns that a regulaar expression can
recognize.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Numbskull

The man who questions opinions is wise.
The man who quarrels with facts is a fool.
 
Back
Top