Regex, TextReader...?

  • Thread starter Thread starter Masahiro Ito
  • Start date Start date
M

Masahiro Ito

I have attached a block of text similar to the type that I am working
with.

I have been learning a lot about Regex - it is quite impressive. I can
easily capture bits of info, but I keep having trouble with line breaks.

I want to identify the start and end of blocks of text. Are there some
tips someone can share?

EG: in my text, I can grab a collection of everyones Phone number with:
^"M:"\t"(?<PhoneNumber>[^"])"

But, what about if I wanted to grab many lines, until it matched a
certain pattern. I use the ^ to say not the quote, but can I say not 14
hyphens?

The way I have split this type of data is inefficient. I match all the
cases of:
^-{14}
Then I use many math equations to split the file using the index of the
matches. I am sure Regex must have some way to pattern match a complex
not, to indicate the end of my match?

Thank you.




--------------
"M:" "3242310532"
"Subscriber Name:" "MR Regex"
"Additional line user name:" ""
"Sublevel:" " "
"Sublevel:" ""
"Reference 1:" ""
"Reference 2:" ""

"CURRENT CHARGES"
"Monthly Service Plan" $40.00
"Additional Local Airtime" $0.00
"Long Distance Charges" $0.00
"Roaming Charges" $0.00
"Network and Licensing Charges" $7.20
"Total Taxes:" $7.09
"Total Current Charges:" $47.20

"MONTHLY SERVICE PLAN" 11-Oct-03 to 10-Nov-03
"Service Plan Name" "Total"
"Mike Dispatch 40 (11-Oct-03 to 10-Nov-03)" $40.00
"Total Monthly Service Plan Charges" $40.00

"ADDITIONAL LOCAL AIRTIME"
"Service" "Total Mins. Used" "Free Mins. Used" "Included Mins.
Used" "Chargeable Mins. Used" "Total"
"Direct Connect Private (minutes)" 28:04 28:04 0:00 0:00 $0.00
"Total Additional Local Airtime Charges" $0.00

"LONG DISTANCE, ROAMING AND OTHER CALL CHARGES"
"Service" "Incl. LD Minutes" "Chargeable LD Minutes" "Total"
"Total Long Distance Charges" $0.00

"ROAMING"
"Service" "Roaming Minutes" "Roaming Charges" "Roaming LD Minutes"
"Roaming LD Charges" "Roaming Surcharge" "Total"
"Total Roaming Charges" $0.00

"WIRELESS WEB - PREMIUM SERVICE"
"Service" "Total Events" "Event Type" "Total"
"Total Wireless Web Premium Services Charges" $0.00

"PHONE - PREMIUM SERVICE"
"Service" "Total Events" "Event Type" "Total"
"Total Phone Premium Services Charges" $0.00

"PAGER SERVICES"
"Service" "Total Messages" "Included Messages" "Chargeable
Messages" "Total"
"Total Pager Charges" $0.00

"VALUE-ADDED SERVICES" 11-Oct-03 to 10-Nov-03
"Service" "Total"
"Wireless Web - Surf Sampler (11-Oct-03 to 10-Nov-03)" $0.00
"Total Value Added Service Charges" $0.00

"OTHER CHARGES AND CREDIT"
"Charge or Credit" "Total"
"Total Other Charges and Credits" $0.00

"NETWORK and LICENSING CHARGES"
"Service" "Total"
"911 Emergency Access Charge (11-Oct-03 to 10-Nov-03)" $0.25
"System Licensing Charge (11-Oct-03 to 10-Nov-03)" $6.95
"Total Network Licensing Charges" $7.20

"TAXES"
"" "Total"
"Total Taxes" $7.09

--------------
"M:" "9042437121"
"Subscriber Name:" "Fred 1"
"Additional line user name:" ""
"Sublevel:" " "
"Sublevel:" ""
"Reference 1:" ""
"Reference 2:" ""

"CURRENT CHARGES"
 
Yes, you can do it in regex. The trick is to allow your pattern to match
more than one time. For example, if I had something like:

1234
34123
11313
113133
xxxxx

I could write something like:

(?<Numbers>^\d+$)+xxxxx

Which means that I need to look at Match.Captures instead of Match.Groups,
IIRC.

Note that in most uses of this technique, what you really need to write is
something like:

((?<Numbers> match numbers) match stuff between numbers)+xxxxx

so that the match can continue. You may also need to play around with the
singleline and multiline options.

--
Eric Gunnerson

Visit the C# product team at http://www.csharp.net
Eric's blog is at http://weblogs.asp.net/ericgu/

This posting is provided "AS IS" with no warranties, and confers no rights.
Masahiro Ito said:
I have attached a block of text similar to the type that I am working
with.

I have been learning a lot about Regex - it is quite impressive. I can
easily capture bits of info, but I keep having trouble with line breaks.

I want to identify the start and end of blocks of text. Are there some
tips someone can share?

EG: in my text, I can grab a collection of everyones Phone number with:
^"M:"\t"(?<PhoneNumber>[^"])"

But, what about if I wanted to grab many lines, until it matched a
certain pattern. I use the ^ to say not the quote, but can I say not 14
hyphens?

The way I have split this type of data is inefficient. I match all the
cases of:
^-{14}
Then I use many math equations to split the file using the index of the
matches. I am sure Regex must have some way to pattern match a complex
not, to indicate the end of my match?

Thank you.




--------------
"M:" "3242310532"
"Subscriber Name:" "MR Regex"
"Additional line user name:" ""
"Sublevel:" " "
"Sublevel:" ""
"Reference 1:" ""
"Reference 2:" ""

"CURRENT CHARGES"
"Monthly Service Plan" $40.00
"Additional Local Airtime" $0.00
"Long Distance Charges" $0.00
"Roaming Charges" $0.00
"Network and Licensing Charges" $7.20
"Total Taxes:" $7.09
"Total Current Charges:" $47.20

"MONTHLY SERVICE PLAN" 11-Oct-03 to 10-Nov-03
"Service Plan Name" "Total"
"Mike Dispatch 40 (11-Oct-03 to 10-Nov-03)" $40.00
"Total Monthly Service Plan Charges" $40.00

"ADDITIONAL LOCAL AIRTIME"
"Service" "Total Mins. Used" "Free Mins. Used" "Included Mins.
Used" "Chargeable Mins. Used" "Total"
"Direct Connect Private (minutes)" 28:04 28:04 0:00 0:00 $0.00
"Total Additional Local Airtime Charges" $0.00

"LONG DISTANCE, ROAMING AND OTHER CALL CHARGES"
"Service" "Incl. LD Minutes" "Chargeable LD Minutes" "Total"
"Total Long Distance Charges" $0.00

"ROAMING"
"Service" "Roaming Minutes" "Roaming Charges" "Roaming LD Minutes"
"Roaming LD Charges" "Roaming Surcharge" "Total"
"Total Roaming Charges" $0.00

"WIRELESS WEB - PREMIUM SERVICE"
"Service" "Total Events" "Event Type" "Total"
"Total Wireless Web Premium Services Charges" $0.00

"PHONE - PREMIUM SERVICE"
"Service" "Total Events" "Event Type" "Total"
"Total Phone Premium Services Charges" $0.00

"PAGER SERVICES"
"Service" "Total Messages" "Included Messages" "Chargeable
Messages" "Total"
"Total Pager Charges" $0.00

"VALUE-ADDED SERVICES" 11-Oct-03 to 10-Nov-03
"Service" "Total"
"Wireless Web - Surf Sampler (11-Oct-03 to 10-Nov-03)" $0.00
"Total Value Added Service Charges" $0.00

"OTHER CHARGES AND CREDIT"
"Charge or Credit" "Total"
"Total Other Charges and Credits" $0.00

"NETWORK and LICENSING CHARGES"
"Service" "Total"
"911 Emergency Access Charge (11-Oct-03 to 10-Nov-03)" $0.25
"System Licensing Charge (11-Oct-03 to 10-Nov-03)" $6.95
"Total Network Licensing Charges" $7.20

"TAXES"
"" "Total"
"Total Taxes" $7.09

--------------
"M:" "9042437121"
"Subscriber Name:" "Fred 1"
"Additional line user name:" ""
"Sublevel:" " "
"Sublevel:" ""
"Reference 1:" ""
"Reference 2:" ""

"CURRENT CHARGES"
 
Thank you Eric. I was doing a capture group (in my first example using
my sample text I used (?<PhoneNumber>[^"]*) to capture everything until
the next " in my phonenumber collection.

In this simple example, capturing the Field 1 and Field5 value, I cannot
reliably regex the 'everything between numbers'.

My attempt (doesn't work:
Field1:\s(<F1>[0-9]*)[^Field5:]*Field5:\s(?<F5>[0-9.$]*)
^trouble^

Field1: 1234
Field2: 34123
Field3: 1313
Field4: 13133
Field5: $xxxx.00
Field6: 2342df
Field1: 2342
Field2: 33241
Field3: 2142
Field4: 543523
Field5: $342.00
Field6: 43254
Field1: 3415
Field2: 234235
Field3: 341
Field4: 13212533
Field5: $5234.00
Field6: 32415

Of course, I can run two separate captures, but...

You gave the example technique : ((?<Numbers> match numbers) match stuff
between numbers)+xxxxx

Does this +xxxxx match everything until the xxxxx is found? In my regex
apps (I use expresso and Regex Workshop as dotnet tools) there are no
matches.

Thanks,

Masa
 
I'm a little confused about what you're trying to do. Given the example text
below, what is the expect output that you want?

If I assume that you didn't mean to write xxxx.00 for the Field5 value
below, the following regex may do what you want:

new Regex(@"
(
(?<S2>.*?)
Field1:\s(?<F1>[0-9]*)
(?<S1>.+?)
Field5:\s(?<F5>[0-9.\$]+)
)+",
RegexOption.IgnorePatternWhitespace);

All the F1 values will be in one capture, all the F5 values in the other
capture. I named the S1 and S2 captures so you could see what they're
matching.

I'd suggest using my Regex Workbench at
http://www.gotdotnet.com/Community/...mpleGuid=C712F2DF-B026-4D58-8961-4EE2729D7322 -
it makes playing around with Regex much easier.

--
Eric Gunnerson

Visit the C# product team at http://www.csharp.net
Eric's blog is at http://weblogs.asp.net/ericgu/

This posting is provided "AS IS" with no warranties, and confers no rights.
Masahiro Ito said:
Thank you Eric. I was doing a capture group (in my first example using
my sample text I used (?<PhoneNumber>[^"]*) to capture everything until
the next " in my phonenumber collection.

In this simple example, capturing the Field 1 and Field5 value, I cannot
reliably regex the 'everything between numbers'.

My attempt (doesn't work:
Field1:\s(<F1>[0-9]*)[^Field5:]*Field5:\s(?<F5>[0-9.$]*)
^trouble^

Field1: 1234
Field2: 34123
Field3: 1313
Field4: 13133
Field5: $xxxx.00
Field6: 2342df
Field1: 2342
Field2: 33241
Field3: 2142
Field4: 543523
Field5: $342.00
Field6: 43254
Field1: 3415
Field2: 234235
Field3: 341
Field4: 13212533
Field5: $5234.00
Field6: 32415

Of course, I can run two separate captures, but...

You gave the example technique : ((?<Numbers> match numbers) match stuff
between numbers)+xxxxx

Does this +xxxxx match everything until the xxxxx is found? In my regex
apps (I use expresso and Regex Workshop as dotnet tools) there are no
matches.

Thanks,

Masa



Yes, you can do it in regex. The trick is to allow your pattern to
match more than one time. For example, if I had something like:

1234
34123
11313
113133
xxxxx

I could write something like:

(?<Numbers>^\d+$)+xxxxx

Which means that I need to look at Match.Captures instead of
Match.Groups, IIRC.

Note that in most uses of this technique, what you really need to
write is something like:

((?<Numbers> match numbers) match stuff between numbers)+xxxxx

so that the match can continue. You may also need to play around with
the singleline and multiline options.
 
I'm a little confused about what you're trying to do. Given the
example text below, what is the expect output that you want?

If I assume that you didn't mean to write xxxx.00 for the Field5 value
below, the following regex may do what you want:

new Regex(@"
(
(?<S2>.*?)
Field1:\s(?<F1>[0-9]*)
(?<S1>.+?)
Field5:\s(?<F5>[0-9.\$]+)
)+",
RegexOption.IgnorePatternWhitespace);

All the F1 values will be in one capture, all the F5 values in the
other capture. I named the S1 and S2 captures so you could see what
they're matching.

I'd suggest using my Regex Workbench at
http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=
C712F2DF-B026-4D58-8961-4EE2729D7322 - it makes playing around with
Regex much easier.


Thanks Eric. Actually, I was using your Regex Workbench already - it is
great! Thank you for sharing it.

Something is not clicking with me and these regex expressions. Even when I
paste your regex, I don't believe I am getting the responses you intended.
In the sample I posted, I am trying to capture the field 1 and field 5
values. I can capture them separately, but can't seem to grasp the 'skip
everything until a specific pattern is matched'.

I am trying to break down your sample piece by piece. Does the @ at the
start do something?

Also, using Regex Workbench, using your sample in your first reply, I am
not getting any matches.
String:
1234
34123
11313
113133
xxxxx

Regex:
(?<Numbers>^\d+$)+xxxxx

I have tried every permutation I can think of with Multi/single line, etc..
I feel like I am going crazy.

Thank you.

Masa
 
Back
Top