Regex parsing e-mail question.

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I basically am trying to match something like keyword: (the : and space is a
marker. I want everything after that all the way up to the next Keyword:
(where keyword HAS to begin a new line. I want everything before the next
keyword.

(\: ).*(\n[a-zA-z]) comes extremely close except for 2 things.
The most important is that it is unable to match patterns where the
"content" that I want spans multiple lines . For example in an e-mail it
would skip over
Received: from unknown (HELO barracuda.domain.com) (127.0.0.1)

by 192.168.2.195 with SMTP; 24 Feb 2005 19:16:52 -0000

Also I am wondering if there is a way to specify that I want everything
"after" the \: and before the \n .

Any help would be greatly appreciated. Below are sample regex and sample
input that I am trying to use. and yes google may have bastardized some of
the input

Regex that I have tried. The first one has produced the closest results.

(\: ).*(\n[a-zA-z])
(\: ).*[(\n\s)].*(\n[a-zA-z])
(\: ).*[(\n\s)].*[^\n[a-zA-Z]]*(\n­[a-zA-Z])


----------------------------------------

Input

-------------------------------------------

Return-Path: <[email protected]>
Delivered-To: (e-mail address removed)
Received: (qmail 21118 invoked from network); 16 Mar 2005 20:41:33
-0000
Received: from unknown (HELO barracuda.domains.com) (192.168.192.194)
by 192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
X-ASG-Debug-ID: 1111005918-25079-3-0
X-Barracuda-URL: http://barracuda.domains.com:8­000/cgi-bin/mark.cgi
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
Received: from domaindev1.domain.local (192-168-1-100.generator.isp.c­om
[192.168.1.100])
by barracuda.domains.com (Spam Firewall) with ESMTP
id AFE2D20A2F39; Wed, 16 Mar 2005 14:45:18 -0600 (CST)
Received: from tetco634 ([192.168.5.193]) by domaindev1.domain.local
with Microsoft SMTPSVC(6.0.3790.211);
Wed, 16 Mar 2005 14:45:44 -0600
From: "user bleah" <[email protected]>
To: <[email protected]>
Cc: <[email protected]>,
<[email protected]­m>
X-ASG-Orig-Subj: New User Signup
Subject: New User Signup
Date: Wed, 16 Mar 2005 14:45:44 -0600
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0­0100_01C52A36.D545F720"
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ­==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Message-ID: <domainDEV1RapPodByDl00000...@­domaindev1.domain.local>
X-OriginalArrivalTime: 16 Mar 2005 20:45:44.0462 (UTC)
FILETIME=[1FDFCAE0:01C52A69]
X-Virus-Scanned: by Barracuda Spam Firewall at domains.com
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0


This is a multi-part message in MIME format.


------=_NextPart_000_00100_01C­52A36.D545F720
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit


Content-Transfer-Encoding: 7bit
 
\: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
I got this to work except I need some way to tell it that the \n[^....] part
must be able to happen multiple times


I basically am trying to match something like keyword: (the : and space is
a marker. I want everything after that all the way up to the next Keyword:
(where keyword HAS to begin a new line. I want everything before the next
keyword.

(\: ).*(\n[a-zA-z]) comes extremely close except for 2 things.
The most important is that it is unable to match patterns where the
"content" that I want spans multiple lines . For example in an e-mail it
would skip over
Received: from unknown (HELO barracuda.domain.com) (127.0.0.1)

by 192.168.2.195 with SMTP; 24 Feb 2005 19:16:52 -0000

Also I am wondering if there is a way to specify that I want everything
"after" the \: and before the \n .

Any help would be greatly appreciated. Below are sample regex and sample
input that I am trying to use. and yes google may have bastardized some of
the input

Regex that I have tried. The first one has produced the closest results.

(\: ).*(\n[a-zA-z])
(\: ).*[(\n\s)].*(\n[a-zA-z])
(\: ).*[(\n\s)].*[^\n[a-zA-Z]]*(\n­[a-zA-Z])


----------------------------------------

Input

-------------------------------------------

Return-Path: <[email protected]>
Delivered-To: (e-mail address removed)
Received: (qmail 21118 invoked from network); 16 Mar 2005 20:41:33
-0000
Received: from unknown (HELO barracuda.domains.com) (192.168.192.194)
by 192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
X-ASG-Debug-ID: 1111005918-25079-3-0
X-Barracuda-URL: http://barracuda.domains.com:8­000/cgi-bin/mark.cgi
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
Received: from domaindev1.domain.local (192-168-1-100.generator.isp.c­om
[192.168.1.100])
by barracuda.domains.com (Spam Firewall) with ESMTP
id AFE2D20A2F39; Wed, 16 Mar 2005 14:45:18 -0600 (CST)
Received: from tetco634 ([192.168.5.193]) by domaindev1.domain.local
with Microsoft SMTPSVC(6.0.3790.211);
Wed, 16 Mar 2005 14:45:44 -0600
From: "user bleah" <[email protected]>
To: <[email protected]>
Cc: <[email protected]>,
<[email protected]­m>
X-ASG-Orig-Subj: New User Signup
Subject: New User Signup
Date: Wed, 16 Mar 2005 14:45:44 -0600
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0­0100_01C52A36.D545F720"
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ­==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Message-ID: <domainDEV1RapPodByDl00000...@­domaindev1.domain.local>
X-OriginalArrivalTime: 16 Mar 2005 20:45:44.0462 (UTC)
FILETIME=[1FDFCAE0:01C52A69]
X-Virus-Scanned: by Barracuda Spam Firewall at domains.com
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0


This is a multi-part message in MIME format.


------=_NextPart_000_00100_01C­52A36.D545F720
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit


Content-Transfer-Encoding: 7bit
 
Hi Recoil,

Have you tried to add a * at the end of \n[^....] part.
(\n[^\<\>a-zA-Z\.\@-])*

Also I still can not understand your scenario very much.
Can you post the input in simple one or two line and output you want?


Best regards,

Peter Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
 
If you look at the end of the original message there is a raw e-mail header.
Basically I need to be able to parse all of the contents of the e-mail.
Some of the Receieved: headers are multi-line headers.
That is the actual contents of that specific header spans multiple lines.
I have tried putting a * at various positions however the results have
either been a) the same results or b) they start grabbing multiple headers
at the same time.

There is a specific pattern to mark the contents of a header
Each header word (this identifies what the header is) will be a full word
and will be at the beginning of each line. It will be preceeded by a colon
and a space.
The contents will then preceed. If the contents span multiple lines there
will be multiple whitespaces at the beginning of each line. Therefore my
regex needs to start at the beginning

.. My pattern if you take the raw contents of an e-mail message and match it
will return an array of headers. Unfortunately any header that spans more
then 2 lines will not be returned.

You can try testing it here. I have an actual applicatio that i use to test
the regex but sadly i've not been able to find a way to get it to keep
matching repetive lines that begin with whitespace and then stop at the
first line that does not beginwith whitespace.

http://www.regexlib.com/RETester.aspx
Regex of
\: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
Input of
Return-Path: <[email protected]>
Delivered-To: (e-mail address removed)
Received: (qmail 21118 invoked from network); 16 Mar 2005 20:41:33
-0000
Received: from unknown (HELO barracuda.domains.com) (192.168.192.194)
by 192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
X-ASG-Debug-ID: 1111005918-25079-3-0
X-Barracuda-URL: http://barracuda.domains.com:8­000/cgi-bin/mark.cgi
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
Received: from domaindev1.domain.local (192-168-1-100.generator.isp.c­om
[192.168.1.100])
by barracuda.domains.com (Spam Firewall) with ESMTP
id AFE2D20A2F39; Wed, 16 Mar 2005 14:45:18 -0600 (CST)
Received: from tetco634 ([192.168.5.193]) by domaindev1.domain.local
with Microsoft SMTPSVC(6.0.3790.211);
Wed, 16 Mar 2005 14:45:44 -0600
From: "user bleah" <[email protected]>
To: <[email protected]>
Cc: <[email protected]>,
<[email protected]­m>
X-ASG-Orig-Subj: New User Signup
Subject: New User Signup
Date: Wed, 16 Mar 2005 14:45:44 -0600
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0­0100_01C52A36.D545F720"
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ­==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Message-ID: <domainDEV1RapPodByDl00000...@­domaindev1.domain.local>
X-OriginalArrivalTime: 16 Mar 2005 20:45:44.0462 (UTC)
FILETIME=[1FDFCAE0:01C52A69]
X-Virus-Scanned: by Barracuda Spam Firewall at domains.com
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0


This is a multi-part message in MIME format.


------=_NextPart_000_00100_01C­52A36.D545F720
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit


Content-Transfer-Encoding: 7bit




\: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
I got this to work except I need some way to tell it that the \n[^....]
part must be able to happen multiple times


I basically am trying to match something like keyword: (the : and space is
a marker. I want everything after that all the way up to the next Keyword:
(where keyword HAS to begin a new line. I want everything before the next
keyword.

(\: ).*(\n[a-zA-z]) comes extremely close except for 2 things.
The most important is that it is unable to match patterns where the
"content" that I want spans multiple lines . For example in an e-mail it
would skip over
Received: from unknown (HELO barracuda.domain.com) (127.0.0.1)

by 192.168.2.195 with SMTP; 24 Feb 2005 19:16:52 -0000

Also I am wondering if there is a way to specify that I want everything
"after" the \: and before the \n .

Any help would be greatly appreciated. Below are sample regex and sample
input that I am trying to use. and yes google may have bastardized some
of the input

Regex that I have tried. The first one has produced the closest results.

(\: ).*(\n[a-zA-z])
(\: ).*[(\n\s)].*(\n[a-zA-z])
(\: ).*[(\n\s)].*[^\n[a-zA-Z]]*(\n­[a-zA-Z])


----------------------------------------

Input

-------------------------------------------

Return-Path: <[email protected]>
Delivered-To: (e-mail address removed)
Received: (qmail 21118 invoked from network); 16 Mar 2005 20:41:33
-0000
Received: from unknown (HELO barracuda.domains.com) (192.168.192.194)
by 192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
X-ASG-Debug-ID: 1111005918-25079-3-0
X-Barracuda-URL: http://barracuda.domains.com:8­000/cgi-bin/mark.cgi
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
Received: from domaindev1.domain.local (192-168-1-100.generator.isp.c­om
[192.168.1.100])
by barracuda.domains.com (Spam Firewall) with ESMTP
id AFE2D20A2F39; Wed, 16 Mar 2005 14:45:18 -0600 (CST)
Received: from tetco634 ([192.168.5.193]) by domaindev1.domain.local
with Microsoft SMTPSVC(6.0.3790.211);
Wed, 16 Mar 2005 14:45:44 -0600
From: "user bleah" <[email protected]>
To: <[email protected]>
Cc: <[email protected]>,
<[email protected]­m>
X-ASG-Orig-Subj: New User Signup
Subject: New User Signup
Date: Wed, 16 Mar 2005 14:45:44 -0600
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0­0100_01C52A36.D545F720"
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ­==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Message-ID: <domainDEV1RapPodByDl00000...@­domaindev1.domain.local>
X-OriginalArrivalTime: 16 Mar 2005 20:45:44.0462 (UTC)
FILETIME=[1FDFCAE0:01C52A69]
X-Virus-Scanned: by Barracuda Spam Firewall at domains.com
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0


This is a multi-part message in MIME format.


------=_NextPart_000_00100_01C­52A36.D545F720
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit


Content-Transfer-Encoding: 7bit
 
Hi

Based on my understanding, you wants to extract all the content after the
xxxxx:
Here goes the code for your reference.

StreamReader sr = new StreamReader(@"..\..\test.txt");
string mstr = sr.ReadToEnd();
string[] strs =Regex.Split(mstr,@"^[\w-]+:",RegexOptions.Multiline);
StreamWriter sw = new StreamWriter(@"..\..\result.txt");
foreach(string str in strs)
{
if(str=="")
continue;
string s = str.Replace("\r\n",string.Empty);
sw.WriteLine(s);
Console.WriteLine(s);
}
sw.Close();

Result:
<[email protected]>
(e-mail address removed)
(qmail 21118 invoked from network); 16 Mar 2005 20:41:33-0000
from unknown (HELO barracuda.domains.com) (192.168.192.194) by
192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
1111005918-25079-3-0
http://barracuda.domains.com:8?00/cgi-bin/mark.cgi
Sender
Sender
Sender
from domaindev1.domain.local
(192-168-1-100.generator.isp.com[192.168.1.100]) by
barracuda.domains.com (Spam Firewall) with ESMTP id AFE2D20A2F39;
Wed, 16 Mar 2005 14:45:18 -0600 (CST)
from tetco634 ([192.168.5.193]) by domaindev1.domain.local with
Microsoft SMTPSVC(6.0.3790.211); Wed, 16 Mar 2005 14:45:44 -0600
"user bleah" <[email protected]>
<[email protected]>
<[email protected]>
New User Signup
New User Signup
Wed, 16 Mar 2005 14:45:44 -0600
1.0
multipart/alternative;
boundary="----=_NextPart_000_0?100_01C52A36.D545F720"
Microsoft Office Outlook, Build 11.0.6353
AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ?=
Produced By Microsoft MimeOLE V6.00.2900.2527
<[email protected]>
16 Mar 2005 20:45:44.0462 (UTC)FILETIME=[1FDFCAE0:01C52A69]
by Barracuda Spam Firewall at domains.com
0.00
No, SCORE=0.00 using global scores ofTAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0
KILL_LEVEL=6.0This is a multi-part message in MIME
format.------=_NextPart_000_00100_01CD2A36.D545F720
text/plain; charset="us-ascii"
7bit
7bit


Best regards,

Peter Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
 
I will keep that method in mind. I guess one of the reasons i had not come
across that approach is that then would require me to extract and make a
copy of e-mail headers only as that would split all of the contents of the
e-mail and I only want the e-mail header.

Thanks.
 
Hi

In the Split method, the regex will try to match the delimiter. For your
concern, I think we just need to clip the header from the content.
If you still have any concern, please feel free to post here.

Best regards,

Peter Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
 
I finaly solved it. I took a mixture of your idea and my original idea
and then merged and altered them so i ended up using IndexOf instead of
regex. It turned out to be about 4-10 times faster when parsing over
1k+ e-mails and for some reason certain data input would cause the
Regex to hang @ 100% cpu usage for infinitity which turned out to be a
real show stopper.

Glad for the help.
 
Back
Top