A Question About Regular Expressions and Capture

  • Thread starter Thread starter eBob.com
  • Start date Start date
E

eBob.com

I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "</b>" followed by "</td>". Is there a
straightforward way to tell it to not include the "</b>" in the value? Note
that the "</b>" is not always present so the pattern has to say that it is
optional.

Thank, Bob


<tr height=24>
<td class=td1 width="35%"><b>Celular</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%"><b>123-abc-5678</b></td>
</tr>



<tr height=24>
<td class=td1 width="35%">Edad</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">24 Años</td>
</tr>

<tr height=24>
<td class=td1 width="35%">Altura</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">1.70 mts.</td>
 
eBob.com said:
I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "</b>" followed by "</td>".

Yes, but remember that regexes are 'greedy' by default - they always
capture as many characters as they can. Thus when given a choice
between:

value: 123-abc-5678</b>
optional </b>: no

and

value: 123-abc-5678
optional </b>: yes

since the 'value' match happens first, and it can legitimately capture
Is there a
straightforward way to tell it to not include the "</b>" in the value?

How about, instead of value capturing one or more of any character with


..+

you instead capture one or more characters that aren't < with

[^<]+

Also, there are flags you can put in to make expressions non-greedy,
but I don't think that will work in this situation.

BUT

I would *urge* you to stop trying to parse HTML with regex, and
instead run (don't walk) to
<http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
and from there download HtmlAgilityPack, which is an absolutely
invaluable library that converts (even malformed) HTML into a nice XML
document tree. It makes doing HTML parsing a hundred times more easy
than trying to use regex.
 
Thank you very much Larry. It finally occurred to me that there had to be
some way to take advantage of the fact that the string I am after does not
contain "<", but the only solution I could think of was very ugly. Your
suggestion is much, much better. And thank you for making me aware of the
HtmlAgilityPack, I will be looking into it.

Thanks, Bob

Larry Lard said:
eBob.com said:
I am using regular expressions and a particular feature called "capture"
(I
think) to suck some information out of some html. I could have never
come
up with this myself but Balena has an example which is very similar to
this.
The guts of the program is ...

Dim i As Integer
Dim rgx As Regex

Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"

Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _

"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra
parenthesis
don't help

rgx = New Regex(Pattern)

tbxPattern.Text = Pattern

Dim m As Match, g As Group

For Each m In rgx.Matches(tbxInput.Text)

g = m.Groups("variable")

lstbxKeys.Items.Add(g.Value)

g = m.Groups("value")

lstbxValues.Items.Add(g.Value)

Next

The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678".
I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more
characters,
terminated by, optionally, "</b>" followed by "</td>".

Yes, but remember that regexes are 'greedy' by default - they always
capture as many characters as they can. Thus when given a choice
between:

value: 123-abc-5678</b>
optional </b>: no

and

value: 123-abc-5678
optional </b>: yes

since the 'value' match happens first, and it can legitimately capture
Is there a
straightforward way to tell it to not include the "</b>" in the value?

How about, instead of value capturing one or more of any character with


.+

you instead capture one or more characters that aren't < with

[^<]+

Also, there are flags you can put in to make expressions non-greedy,
but I don't think that will work in this situation.

BUT

I would *urge* you to stop trying to parse HTML with regex, and
instead run (don't walk) to
<http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html>,
and from there download HtmlAgilityPack, which is an absolutely
invaluable library that converts (even malformed) HTML into a nice XML
document tree. It makes doing HTML parsing a hundred times more easy
than trying to use regex.
 
Back
Top