E
eBob.com
I am using regular expressions and a particular feature called "capture" (I
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...
Dim i As Integer
Dim rgx As Regex
Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
"\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"
Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis
don't help
rgx = New Regex(Pattern)
tbxPattern.Text = Pattern
Dim m As Match, g As Group
For Each m In rgx.Matches(tbxInput.Text)
g = m.Groups("variable")
lstbxKeys.Items.Add(g.Value)
g = m.Groups("value")
lstbxValues.Items.Add(g.Value)
Next
The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "</b>" followed by "</td>". Is there a
straightforward way to tell it to not include the "</b>" in the value? Note
that the "</b>" is not always present so the pattern has to say that it is
optional.
Thank, Bob
<tr height=24>
<td class=td1 width="35%"><b>Celular</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%"><b>123-abc-5678</b></td>
</tr>
<tr height=24>
<td class=td1 width="35%">Edad</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">24 Años</td>
</tr>
<tr height=24>
<td class=td1 width="35%">Altura</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">1.70 mts.</td>
think) to suck some information out of some html. I could have never come
up with this myself but Balena has an example which is very similar to this.
The guts of the program is ...
Dim i As Integer
Dim rgx As Regex
Dim Pattern As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
"\s*.*\s*<td class=td2 width=""65%"">(<b>){0,1}(?<value>.+)(</b>){0,1}</td>"
Dim Pattern2 As String = "<td class=td1
width=""35%"">(<b>){0,1}(?<variable>(\w| )+)</td>" + _
"\s*.*\s*<td class=td2
width=""65%"">(<b>){0,1}((?<value>.+))(</b>){0,1}</td>" ' extra parenthesis
don't help
rgx = New Regex(Pattern)
tbxPattern.Text = Pattern
Dim m As Match, g As Group
For Each m In rgx.Matches(tbxInput.Text)
g = m.Groups("variable")
lstbxKeys.Items.Add(g.Value)
g = m.Groups("value")
lstbxValues.Items.Add(g.Value)
Next
The data looks like this (below). It works fine for all cases except the
first (the "Celular" data) where the value is picked up as
"123-abc-5678</b>". I want, and I think it should be, "123-abc-5678". I
can't understand why the "</b>" is included in the value. Doesn't my
pattern clearly show that the value is a string of one or more characters,
terminated by, optionally, "</b>" followed by "</td>". Is there a
straightforward way to tell it to not include the "</b>" in the value? Note
that the "</b>" is not always present so the pattern has to say that it is
optional.
Thank, Bob
<tr height=24>
<td class=td1 width="35%"><b>Celular</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%"><b>123-abc-5678</b></td>
</tr>
<tr height=24>
<td class=td1 width="35%">Edad</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">24 Años</td>
</tr>
<tr height=24>
<td class=td1 width="35%">Altura</td>
<td width=1><img src="../img/p.gif" width=1 height=1></td>
<td class=td2 width="65%">1.70 mts.</td>