Using regex in html code

  • Thread starter Thread starter Nightcrawler
  • Start date Start date
N

Nightcrawler

Hi all.

I have a html table with multiple rows (one row example below). I
would like to extract everything within the <td> tags into groups on a
row by row basis. The process would be: find the first row, then
extract the column data, store data in a textfile, find the next row,
extract the column data, store data in a textfile.... and so on till
we go through all the rows in the document.

Please help.

Thanks in advance.

<tr>
<td>1</td>
<td>GET UP </td>
<td>CIARA FT CHAMILLIONAIRE</td>
<td>04:25</td>
<td>128.66</td>
<td></td>
<td>Step Up [Soundtrack]</td>
<td></td>
<td>R&B/Rap</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
<td>Stripe, (-1.6 dB, -0.7 dB)</td>
<td></td>
<td></td>
<td>2006/01/01</td>
<td>256000</td>
<td></td>
<td>2</td>
<td>2007/03/28</td>
<td>2006/12/04</td>
<td>2007/3/28 20:50:16</td>
<td>00:07</td>
<td>B</td>
</tr>
 
* Nightcrawler wrote, On 23-5-2007 6:59:
Hi all.

I have a html table with multiple rows (one row example below). I
would like to extract everything within the <td> tags into groups on a
row by row basis. The process would be: find the first row, then
extract the column data, store data in a textfile, find the next row,
extract the column data, store data in a textfile.... and so on till
we go through all the rows in the document.

You're better off using the HTML Agility Pack.

But it can be done using regex:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

This will give you one group which will hold all the TD's found. I've
written it quite robust, but this isn't the best available
implementation. If the HTML tables are of a well known format, this
would be no problem. If they come from an external source, you might wat
to test more rigorously.

I'll try to explain:
<tr((?!<td).)*
Find every a TR starting tag and capture anything after that till you
find a <td

(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*
snip off the TD tag and capture it's content till you're at a </td. Then
caputure the </td> and any whitespace or newline that might follow.
Repeat till all TD's have been tagged for this row.

((?!</tr).)*</tr[^>"*]*>
Capture everything that follows the last <td>...</td> combination

Executing Regex.Matches will give you a MatchCollection. Each item in
the matchcollection will have 1 Group named "TD". This group has a list
of Captures which will contain all the values captured in this Group name.

Kind Regards,

Jesse Houwing
Please help.

Thanks in advance.

<tr>
<td>1</td>
<td>GET UP </td>
<td>CIARA FT CHAMILLIONAIRE</td>
<td>04:25</td>
<td>128.66</td>
<td></td>
<td>Step Up [Soundtrack]</td>
<td></td>
<td>R&B/Rap</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
<td>Stripe, (-1.6 dB, -0.7 dB)</td>
<td></td>
<td></td>
<td>2006/01/01</td>
<td>256000</td>
<td></td>
<td>2</td>
<td>2007/03/28</td>
<td>2006/12/04</td>
<td>2007/3/28 20:50:16</td>
<td>00:07</td>
<td>B</td>
</tr>
 
You will need to split the string in order to do this. It can be done by
using 2 regular expressions, very similar:

(?s)<tr[^>]*>(?<content>.*?)</tr>

Splits the table into a match for each row.

Once you have the array of row strings, you can use:

(?s)<td[^>]*>(?<content>.*?)</td>

Splits the row into a match for each column.

The reason it can't be done in one pass is that you need to create a match
for each row, and the match cannot contain "sub-matches," only groups, and
unless you know how many columns there are, you can't create a group for
each column. If you DO know how many columns there are, you can, as in:

(?s)<tr[^>]*>.*?(?<row1><td[^>]*>(?<row1content>.*?)</td>).*?(?<row2><td[^>]*>(?<row2content>.*?)</td>).*?</tr>

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 
The reason it can't be done in one pass is that you need to create a match
for each row, and the match cannot contain "sub-matches," only groups, and
unless you know how many columns there are, you can't create a group for
each column. If you DO know how many columns there are, you can, as in:

Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----> * Groups 1 ----> * Captures

Which - sort of - translates to:

Rows ----> * Cells ----> * Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing
 
The reason it can't be done in one pass is that you need to create a
match for each row, and the match cannot contain "sub-matches," only
groups, and unless you know how many columns there are, you can't create
a group for each column. If you DO know how many columns there are, you
can, as in:
Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----> * Groups 1 ----> * Captures

Which - sort of - translates to:

Rows ----> * Cells ----> * Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing
 
I've got to hand it to you, Jesse.That is possibly the most creative use
I've ever seen of regular expressions and the System.Text.RegularExpressions
NameSpace and classes. I tested it too, and while it took me a good while to
get my head around what it was doing, and I will have to mull it over some
more before I fully understand it, it does work beautifully. I'd love to see
some more of your regex work some time.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

Jesse Houwing said:
The reason it can't be done in one pass is that you need to create a
match for each row, and the match cannot contain "sub-matches," only
groups, and unless you know how many columns there are, you can't create
a group for each column. If you DO know how many columns there are, you
can, as in:

Kevin,

You actually can get multiple results for the same named group. the
structure is as follows:

MatchCollection 1 ----> * Groups 1 ----> * Captures

Which - sort of - translates to:

Rows ----> * Cells ----> * Cell Values

The expression which will capture this info correctly would then be
something like this:

<tr((?!<td).)*(?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
ExplicitCapure ON
SingleLine ON
SaseInsensitive ON

I tested it and it works like a charm.

Kind regards,

Jesse Houwing
 
* Kevin Spencer wrote, On 24-5-2007 13:48:
I've got to hand it to you, Jesse.That is possibly the most creative use
I've ever seen of regular expressions and the System.Text.RegularExpressions
NameSpace and classes. I tested it too, and while it took me a good while to
get my head around what it was doing, and I will have to mull it over some
more before I fully understand it, it does work beautifully. I'd love to see
some more of your regex work some time.

Kevin,

Thank you :).

Jesse
 
Back
Top