Regex for HTML

  • Thread starter Thread starter TJoker .NET
  • Start date Start date
T

TJoker .NET

Hi all.
I have this database table (inherited from an legacy application) that
contains some information that I want to extract.
Basically, in one of the tables, there's a column containing a description
that starts with a NUMBER, but can be preceeded by some raw html elements.
Examples:
ex1:
<p>12 this is the first item ....
ex2:
<p>12. this is the first item ....
ex3:
<span id="my id" style="width:3" ><p>12. this is the first item ....
ex4:
12. this is the first item ....

I'm trying to extract the Number ("12" in all above examples)

The closest I got was when I tried the following regular expression pattern
:
string pattern = @"(<\w*>)*(?<digit>(\d+)).+";

It didn't match put the number in the right match group (= digit). I'm
still new to Regex.

Has anybody came accross any similar situation ?

thnks a bunch

TJ !
 
Hi all.
I have this database table (inherited from an legacy application) that
contains some information that I want to extract.
Basically, in one of the tables, there's a column containing a
description that starts with a NUMBER, but can be preceeded by some
raw html elements. Examples:
ex1:
<p>12 this is the first item ....
ex2:
<p>12. this is the first item ....
ex3:
<span id="my id" style="width:3" ><p>12. this is the first item ....
ex4:
12. this is the first item ....

I'm trying to extract the Number ("12" in all above examples)

The closest I got was when I tried the following regular expression
pattern
:
string pattern = @"(<\w*>)*(?<digit>(\d+)).+";

It didn't match put the number in the right match group (= digit).
I'm still new to Regex.

hmm I'd try the following .NET regular expression:

"(<[^>]+>)*(?<digit>\d+)[^\d]"
0 or more tags where a tag is defined as starting with '<' followed by at
least 1 character not a '>' followed by a '>'.

followed by a string consisting of all the digits (at least 1) up to but
not including the 1st non digit. This could be a problem if it is
possible for the number to be the last thing on the line. It will work if
there are always characters that follow the number.

Mike
 
Back
Top